How to see input records of a particular hadoop task?
I am running a hadoop job. All, but 4 tasks are done. I am pondering why is it taking so much longer to process those chunks. My guess is that those input records are "hard" to process by my job. To test locally I would like to retrieve those input records. How an I do this?
The status column for the task says hdfs://10.4.94.75:8020/user/someuser/myfilename:154260+3
But what does it mean?
The last part of the status gives you information about the split. More specifically:
tells you that the task having this status processed the split of "myfilename" starting at byte offset 154260 in "myfilename" and having length 3.
Given this piece of information, you can detect the records assigned to this task by skiping in the file to byte 154260 and reading 3 bytes.