Is it possible to process multi-line records using Hadoop Streaming?
I have records like this:
Name: Alan Kay Email: Alan.Kay@url.com Date: 09-09-2013 Name: Marvin Minsky Email: Marvin.Minsky@url.com City: Boston, MA Date: 09-10-2013 Name: Alan Turing City: New York City, NY Date: 09-10-2013
They're multiline but not always of the same number of lines, and they're usually separated by a newline. How would I convert it to the output below?
Alan Kay|Alan.Kay@url.com||09-09-2013 Marvin Minsky|Marvin.Minsky@url.com|Boston,MA|09-10-2013 Alan Turing||New York City, NY|09-10-2013
Apache Pig treats each line as a record, so it's not suited for this task. I'm aware of this blog post on processing multi-line records, but I'd prefer not to delve into Java if there's a simpler solution. Is there a way to solve this using Hadoop Streaming (or a framework like mrjob)?
There is no short cut way of doing this. You have to create your own inputFormat and RecordReader class then you can specify those classes in Hadoop streaming command. Follow this: