Meaning of map time or reduce time in JobHistoryServer
I want to know the exact meaning of the notations in the below picture. This picture came from job history server web UI. I definitely know the meaning of Elapsed but I am not sure about other things. Where can I find clear definition of those? Or is there anyone who knows the meaning of those?
What I want to know is map time, reduce time, shuffle time and merge time separately. And the sum of the four time should be very similar(or equal) to elapsed time. But the 'Average' keyword makes me confuse.
There are 396 map, and 1 reduce.
As you probably already know, there are three phases to a MapReduce job:
Map is the 1st phase, where each Map task is provided with an input split, which is a small portion of the total input data. The Map tasks process data from the input split & output intermediate data which needs to go to the reducers.
Shuffle phase is the next step, where the intermediate data that was generated by Map tasks is directed to the correct reducers. Reducers usually handle a subset of the total number of keys generated by the Map task. The Shuffle phase assigns keys to reducers & sends all values pertaining to a key to the assigned reducer. Sorting (or Merging) is also a part of this phase, where values of a given key are sorted and sent to the reducer. As you may realize, the shuffle phase involves transfer of data across the network from Map -> Reduce tasks.
Reduce is the last step of the MapReduce Job. The Reduce tasks process all values pertaining to a key & output their results to the desired location (HDFS/Hive/Hbase).
Now coming to the average times, you said there were 396 map tasks. Each Map task is essentially doing exactly the same processing job, but on different chunks of data. So the Average Map time is basically the average of time taken by all 396 map tasks to complete.
Average Map Time = Total time taken by all Map tasks/ Number of Map Tasks
Average Reduce Time = Total time taken by all Reduce tasks/Number of Reduce tasks
Now, why is the average time significant? It is because, most, if not all your map tasks & reduce tasks would be running in parallel (depending on your cluster capacity/ no. of slots per node, etc.). So calculating the average time of all map tasks & reduce tasks will give you good insight into the completion time of the Map or Reduce phase as a whole.
Another observation from your screenshot is that your Shuffle phase took 40 minutes. There can be several reasons for this.
You have 396 map tasks, each generating intermediate data. The shuffle phase had to pass all this data across the network to just 1 reducer, causing a lot of network traffic & hence increasing transfer time. Maybe you can optimize performance by increasing the number of reducers.
The network itself has very low bandwidth, and cannot efficiently handle large amounts of data transfer. In this case, consider deploying a combiner, which will effectively reduce the amount of data flowing through your network between the map and reduce phases.
There are also some hidden costs of execution such as job setup time, time required by job tracker to contact task trackers & assign map/reduce tasks, time taken by slave nodes to send heartbeat signals to JobTracker, time taken by NameNode to assign storage block & create Input splits, etc. which all go into the total elapsed time.
Hope this helps.
Maybe set hive.hadoop.supports.splittable.combineinputformat=true is used for you;