Pig script/command to filter files on specific strings

I am trying to write Hadoop Pig script which will take 2 files and filter based on string i.e

words.txt

google 
facebook 
twitter 
linkedin

tweets.json

{"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...", "user_id": 450990391, "id": 252479809098223616, "created_date": "Sun Sep 30 2012"}

SCRIPT

twitter  = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
    filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
    extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;
    final = GROUP extracted BY pattern;
    dump final;

OUTPUT

(facebook,{(facebook,252545104890449921,291041644,23:06:59 ,Sun Sep 30 2012,RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...)})

the output that im getting is, without loading the words.txt file i.e by filtering the tweet directly.

I need to get the output as

(facebook)(complete tweet of that facebook word contained)

i.e it should read the words.txt and as words are reading according to that it should get all the tweets from tweets.json file

Any help

Mohan.V

Answers


You could think something in direction of running multiple statements within FOREACH statement. Something like this-

final = FOREACH words  {
            a = CONCAT(CONCAT('.*',words.$0),'.*') as aaa;
            filtered = FILTER twitter BY (text MATCHES aaa);
        generate a, flatten(filtered) as output; }

Please note that this is just to give an idea and I have not tested it. I will try test as soon as I get access to the Pig environment, but this should get you started.


Need Your Help

HttpModule not working for ajax request

c# .net ajax httpmodule

I am having an issue about .net HttpModule.

How can I efficiently 'subclass' in external CSS?

html css stylesheet code-duplication

I have a box defined that works for most of my site: