Structuring Unstructured data with mapreduce

I have a log file, A near snapshot is given below:

<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data
garbage data
garbade data
Io exception
>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data
garbage data java.rmi.RemoteException
>

I am trying to built an analysis on top of it.

What I want to do:

I want to get the count of Exception per year

for Example: from above sample data my output should be

    java.rmi.RemoteException 2013 1
    Io exception             2013 2
    java.rmi.RemoteException 2014 1

What is my problem:

1.You see hadoop processes line by line of a text file, so it considers Io exception as
 a part of line 6 whereas it should be a part of line 3 (that is continued till line 7).

2. I can't use N line input formatter because ther's no fixed pattern of lines.

What is the pattern and what I want:

The only pattern I see is that a line starts with a "<" and ends with a ">". In the 
above example line 3 doesn't end with ">" hence I want the compiler to consider all the 
data in the same line until it fetches a ">".

The sample data how I want my compiler to see is :

<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data garbage data garbade data Io exception>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data garbage data java.rmi.RemoteException>

I will be glad If anybody could share a piece of code or idea to overcome this problem.

Thanks in advance :)

Answers


You would need to implement your InputFormat & RecordReader. What you really need is an adaptation of StreamInputFormat. This is present in hadoop-streaming project.

For our usage which is a multi line XML we use hadoop-straeming to read from start tag to end tag which we define. You can check the source and adapt it for your requirements.


Need Your Help

How to know that the interpreter is Jython or CPython in the code?

python jython

Is there a way to detect that the interpreter that executes the code is Jython or CPython?

uniquiely differentiate between multiple cameras [dshow, directx, win32api]

winapi camera directshow

Is there a sure way to programmatically differentiate between multiple cameras?