Hadoop word count example - null pointer exception

I'm Hadoop beginner. My setup: RHEL7, hadoop-2.7.3

I'm trying to run the Example:_WordCount_v2.0. I just copied the source code to new eclipse project and export it to wc.jar file.

Now, I've configured hadoop Pseudo-Distributed Operation as explianed in the link. Then I start by the following:

creating input files in input directory:

echo "Hello World, Bye World!" > input/file01
echo "Hello Hadoop, Goodbye to hadoop." > input/file02

start env:

sbin/start-dfs.sh
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/<username>
bin/hdfs dfs -put input input
bin/hadoop jar ws.jar WordCount2 input output

and this is what I got:

16/09/02 13:15:01 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/09/02 13:15:01 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/09/02 13:15:01 INFO input.FileInputFormat: Total input paths to process : 2
16/09/02 13:15:01 INFO mapreduce.JobSubmitter: number of splits:2
16/09/02 13:15:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local455553963_0001
16/09/02 13:15:01 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/09/02 13:15:01 INFO mapreduce.Job: Running job: job_local455553963_0001
16/09/02 13:15:01 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/09/02 13:15:01 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/09/02 13:15:01 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/09/02 13:15:02 INFO mapred.LocalJobRunner: Waiting for map tasks
16/09/02 13:15:02 INFO mapred.LocalJobRunner: Starting task: attempt_local455553963_0001_m_000000_0
16/09/02 13:15:02 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/09/02 13:15:02 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
16/09/02 13:15:02 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/aii/input/file02:0+33
16/09/02 13:15:02 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/09/02 13:15:02 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/09/02 13:15:02 INFO mapred.MapTask: soft limit at 83886080
16/09/02 13:15:02 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/09/02 13:15:02 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/09/02 13:15:02 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/09/02 13:15:02 INFO mapred.MapTask: Starting flush of map output
16/09/02 13:15:02 INFO mapred.LocalJobRunner: Starting task: attempt_local455553963_0001_m_000001_0
16/09/02 13:15:02 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/09/02 13:15:02 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
16/09/02 13:15:02 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/aii/input/file01:0+24
16/09/02 13:15:02 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/09/02 13:15:02 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/09/02 13:15:02 INFO mapred.MapTask: soft limit at 83886080
16/09/02 13:15:02 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/09/02 13:15:02 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/09/02 13:15:02 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/09/02 13:15:02 INFO mapred.MapTask: Starting flush of map output
16/09/02 13:15:02 INFO mapred.LocalJobRunner: map task executor complete.
16/09/02 13:15:02 WARN mapred.LocalJobRunner: job_local455553963_0001
java.lang.Exception: java.lang.NullPointerException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
    at WordCount2$TokenizerMapper.setup(WordCount2.java:47)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/02 13:15:02 INFO mapreduce.Job: Job job_local455553963_0001 running in uber mode : false
16/09/02 13:15:02 INFO mapreduce.Job:  map 0% reduce 0%
16/09/02 13:15:02 INFO mapreduce.Job: Job job_local455553963_0001 failed with state FAILED due to: NA
16/09/02 13:15:02 INFO mapreduce.Job: Counters: 0

No results (output) was given. Why did I get that exception?

Thanks

EDIT:

Thanks to solutions suggested I've realized that there is a second try (in the wordCount example):

echo "\." > patterns.txt
echo "\," >> patterns.txt
echo "\!" >> patterns.txt
echo "to" >> patterns.txt

and then run:

bin/hadoop jar ws.jar WordCount2 -Dwordcount.case.sensitive=true input output -skip patterns.txt

and everything is work great!

Answers


The problem is occurring in the setup() method of the mapper. This wordcount example is a bit more advanced than usual and allows you to specify a file containing patterns the mapper will filter out. This file is added to the cache in the main() method, so that its available on every node for the mappers to open.

You can see the files being added to the cache in main():

for (int i=0; i < remainingArgs.length; ++i) {
    if ("-skip".equals(remainingArgs[i])) {
        job.addCacheFile(new Path(remainingArgs[++i]).toUri());
        job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
    } else {
        otherArgs.add(remainingArgs[i]);
    }
}

You aren't specifying the -skip option so it won't try and add anything. If a file is added you can see it sets wordcount.skip.patterns to true.

In the mappers setup() you have this code:

 @Override
 public void setup(Context context) throws IOException, InterruptedException {
     conf = context.getConfiguration();
     caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
     if (conf.getBoolean("wordcount.skip.patterns", true)) {
         URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
         for (URI patternsURI : patternsURIs) {
             Path patternsPath = new Path(patternsURI.getPath());
             String patternsFileName = patternsPath.getName().toString();
             parseSkipFile(patternsFileName);
         }
     }
}

The problem is this check conf.getBoolean("wordcount.skip.patterns", true) defaults to true if its not set, and in your case it won't be. Thus patternsURIs or something around there (i dont have the line numbers) will be null.

So you can either change wordcount.case.sensitive to default to false, set it to false in the driver (main method) or provide a skip file to fix it.


The problem may be this part of your code:

 caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
 if (conf.getBoolean("wordcount.skip.patterns", true)) {
     URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
     for (URI patternsURI : patternsURIs) {
         Path patternsPath = new Path(patternsURI.getPath());
         String patternsFileName = patternsPath.getName().toString();
         parseSkipFile(patternsFileName);
     }
 }

Here getCacheFiles() is returning null for any reason. That's why when you are trying to iterate over patternsURIs (which has nothing but null), you are getting exception.

To solve this, before starting loop check whether patternsURIs is null or not.

if(patternsURIs != null) {
    for (URI patternsURI : patternsURIs) {  
      Path patternsPath = new Path(patternsURI.getPath());
      String patternsFileName = patternsPath.getName().toString();
      parseSkipFile(patternsFileName);
    }
}

You should also check why are you getting null, if it is not expected to get null.


Need Your Help

Hide if html is the same

javascript jquery html events if-statement

I'm using Feedburner to show feeds. sometimes the feeds have the same title. In situations where this is the case I would like to show only the first title and hide all the other titles with the same