Is it possible to process multi-line records using Hadoop Streaming?

I have records like this:

Name: Alan Kay
Email: Alan.Kay@url.com
Date: 09-09-2013

Name: Marvin Minsky
Email: Marvin.Minsky@url.com
City: Boston, MA
Date: 09-10-2013

Name: Alan Turing
City: New York City, NY
Date: 09-10-2013

They're multiline but not always of the same number of lines, and they're usually separated by a newline. How would I convert it to the output below?

Alan Kay|Alan.Kay@url.com||09-09-2013
Marvin Minsky|Marvin.Minsky@url.com|Boston,MA|09-10-2013
Alan Turing||New York City, NY|09-10-2013

Apache Pig treats each line as a record, so it's not suited for this task. I'm aware of this blog post on processing multi-line records, but I'd prefer not to delve into Java if there's a simpler solution. Is there a way to solve this using Hadoop Streaming (or a framework like mrjob)?

Answers


There is no short cut way of doing this. You have to create your own inputFormat and RecordReader class then you can specify those classes in Hadoop streaming command. Follow this:

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/


Need Your Help

Writing ruby web-service. What architecture should I use? Rack, Rails-engine?

ruby-on-rails ruby web-services rack

I need to write web-service that will get files over http and convert them. This service in future might be included in rails application. I wonder what I need to use for that?

Is it possible to have my closed app run code in iOS?

iphone ios ios5

Let's say that my app in iOS is closed. Would it be possible to set it up so that it runs a piece of its code after a certain period of time?