Today during the code review, an important lesson was learned.
If you wrote Hadoop reducer before, you will know that one Reducer host will have many keys assigned to it based on the partition method. And in the run() method, it will iterate the keys and corresponding values and pass them to reducer() method, so each call of reducer() will handle only one key and its values.
I was trying to collect the data statistic information of every key. My initial method was to use a global statistic variable to collect the data, so we don’t need to allocate memory for it every reducer() call, but I forgot to clear the global variable at each reducer() call. So the data for every key just accumulated, and there are two ArrayList in the variable. There are tens of thousands of keys each Reducer, which make the ArrayList so huge and thus took much memory. The whole MapReduce process takes 5 hours.
After I found the bug and add the clear function, it drops to 8 minutes.
Later, I made the global variable local, so every time we need to initialize new one and we don’t need to worry about the clear stuff, which took 9 minutes.
In order to gain the 1 minute, I took hours to wait for the results and debug,which made a very *good* example of Donald Knuth‘s quote: Premature optimization is the root of all evil