Map-side join with Hadoop Streaming
I have a file in which each line is a record. I want all records with the same value in a certain field (call if field A) to go to the same mapper. I've heard this is called a Map-Side Join, and I also heard that it's easy if the records in the file are sorted by what I call field A.
If it would be easier, the data could be spread across multiple files, but each file sorted on field A.
Is this right? How do I do this in with streaming? I'm using Python. A assume it's just part of the command I use to start Hadoop?
Thank you in advance for any help!
What is the real justification for wanting only certain records to go to certain mappers? If what you want out of this is the final result to be 3 output files (one with all A, another with all B, last with all C), you can accomplish that with multiple reducers. Need to know what you really want to accomplish.