Hadoop map tasks reading from same node statistics
I am currently using Hadoop. I was wondering if I can get information about whether the map tasks for a given job are reading their data from its their own node or from other nodes. I know that HDFS is distributed among all the nodes, but is there any counters/metrics that would say, for a given job and a given map task, how much data was read from the same node the amp task is running, and how much data was read through the network.
Hadoop doesn't have a counter to tell you how much data was read locally and how much was read over the network.
The only thing you could do would be to combine different standard counters and based on that to get an approximation of locally and over-the-network read data.
DATA_LOCAL_MAPS: the number of map tasks in the job, using local data (local to the machine).
RACK_LOCAL_MAPS: the number of map tasks that ran on a node in the same rack as their input data.
OTHER_LOCAL_MAPS: the number of tasks that ran on a node in a different rack than the one where their input data is located.
MAP_INPUT_BYTES: tells you how much data was consumed in total by all map tasks (entire job)
*(you should check the exact name of the counters for your distribution of hadoop)
Considering that EACH map processes ONE input split and that the default input splits are approximately equal, you can find the total amount of locally processed data using this formula:
DATA_LOCAL_MAPS * MAP_INPUT_BYTES/(DATA_LOCAL_MAPS + RACK_LOCAL_MAPS + OTHER_LOCAL_MAPS)
The second term of the multiplication gives the number of input bytes per map task