Map-side join with Hadoop Streaming

I have a file in which each line is a record. I want all records with the same value in a certain field (call if field A) to go to the same mapper. I've heard this is called a Map-Side Join, and I also heard that it's easy if the records in the file are sorted by what I call field A.

If it would be easier, the data could be spread across multiple files, but each file sorted on field A.

Is this right? How do I do this in with streaming? I'm using Python. A assume it's just part of the command I use to start Hadoop?

Thank you in advance for any help!

Answers


What is the real justification for wanting only certain records to go to certain mappers? If what you want out of this is the final result to be 3 output files (one with all A, another with all B, last with all C), you can accomplish that with multiple reducers. Need to know what you really want to accomplish.


Need Your Help

HTML - how to display local data in img tag?

javascript jquery html css

I have an editable div and a user can paste a picture copied from Office Word. After doing so an image-tag can be seen in the source code, but the path does not work anymore. If I just enter the so...

Hide and unhide series in a chart based on their name and i have error 1104

excel-vba vba excel

I am trying to hide and unhide series in a chart based on their name using excel vba and I have a error 1004 invalid parameter after the first run of the for cycle.