Hadoop PathFilter config is null

I've got a path filter that looks like this:

public class AvroFileInclusionFilter extends Configured implements PathFilter {
  Configuration conf;

  @Override
  public void setConf(Configuration conf) {
      this.conf = conf;
  }

  @Override
  public boolean accept(Path path) {

      System.out.println("FileInclusion: " + conf.get("fileInclusion"));

      return true;
  }
}

I am explicitly setting the fileInclusion property on the configuration. For some reason, the configuration being used in the path filter is not the same configuration that I am setting up in my job, like so:

    Job job = Job.getInstance(getConf(), "Stock Updater");

    job.getConfiguration().set("outputPath", opts.outputPath);

    String[] inputPaths = findPathsForDays(job.getConfiguration(),
            new Path(opts.inputPath), findDaysToQuery(job.getConfiguration(),
                    opts.updatefile)).toArray(new String[]{});
    job.getConfiguration().set("fileInclusion", "hello`");

    AvroKeyValueInputFormat.addInputPath(job, new Path(opts.inputPath));
    job.getConfiguration().set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());

    job.setInputFormatClass(AvroKeyValueInputFormat.class);

    LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
    AvroKeyValueOutputFormat.setOutputPath(job, new Path(opts.outputPath));

    job.addCacheFile(new Path(opts.updatefile).toUri());

    AvroKeyValueOutputFormat.setCompressOutput(job, true);
    job.getConfiguration().set(AvroJob.CONF_OUTPUT_CODEC, snappyCodec().toString());

    AvroJob.setInputKeySchema(job, DateKey.SCHEMA$);
    AvroJob.setInputValueSchema(job, StockUpdated.SCHEMA$);
    AvroJob.setMapOutputKeySchema(job, DateKey.SCHEMA$);
    AvroJob.setMapOutputValueSchema(job, StockUpdated.SCHEMA$);
    AvroJob.setOutputKeySchema(job, DateKey.SCHEMA$);
    AvroJob.setOutputValueSchema(job, StockUpdated.SCHEMA$);

    job.setMapperClass(StockUpdaterMapper.class);
    job.setReducerClass(StockUpdaterReducer.class);

    AvroMultipleOutputs.addNamedOutput(job, "output", AvroKeyValueOutputFormat.class,
            DateKey.SCHEMA$, StockUpdated.SCHEMA$);

    job.setJarByClass(getClass());

    boolean success = job.waitForCompletion(true);

The conf.get("fileInclusion") is always null and I cannot seem to figure out why. I've been working on this for quite awhile not and I'm pretty much at the end of my rope. Why is the configuration different? I'm submitting the job using "hadoop jar" and "yarn jar".

Answers


Instead of creating the object job by giving getConf() method as argument, try the following

Configuration conf = new Configuration();
conf.set("outputPath", opts.outputPath);
conf.set("mapred.input.pathFilter.class", AvroFileInclusionFilter.class.getName());
..
..
// After setting up the required key values in Configuration object Create Job object by supplying conf
Job job = new Job(conf, "Stock Updater"); 

PathFilter should 'implements Configurable' instead of 'extends Configured'


Need Your Help

Cannot find WebBrowser.NavigateToString() method

c# .net winforms webbrowser-control

I've heard nice stuff about the NavigateToString() C# method, but i can't seem to be able to use it =(