Mahout - Exception: Java Heap space

I'm trying to convert some texts to mahout sequence files using:

mahout seqdirectory -i Lastfm-ArtistTags2007 -o seqdirectory

But all I get is a OutOfMemoryError, as here:

Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout/mahout-examples-0.9-job.jar
14/04/07 16:44:34 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[Lastfm-ArtistTags2007], --keyPrefix=[], --method=[mapreduce], --output=[seqdirectoryjps], --startPhase=[0], --tempDir=[temp]}
14/04/07 16:44:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/07 16:44:35 INFO input.FileInputFormat: Total input paths to process : 4
14/04/07 16:44:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/07 16:44:35 INFO mapred.JobClient: Running job: job_local407267609_0001
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Starting task: attempt_local407267609_0001_m_000000_0
14/04/07 16:44:35 INFO util.ProcessTree: setsid exited with exit code 0
14/04/07 16:44:35 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6ad3ad65
14/04/07 16:44:35 INFO mapred.MapTask: Processing split: Paths:/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/README.txt:0+2472,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/ArtistTags.dat:0+71652722,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/tags.txt:0+1739746,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/artists.txt:0+327051
14/04/07 16:44:35 INFO compress.CodecPool: Got brand-new compressor
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/07 16:44:35 WARN mapred.LocalJobRunner: job_local407267609_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
    at org.apache.mahout.text.WholeFileRecordReader.nextKeyValue(WholeFileRecordReader.java:118)
    at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
14/04/07 16:44:36 INFO mapred.JobClient:  map 0% reduce 0%
14/04/07 16:44:36 INFO mapred.JobClient: Job complete: job_local407267609_0001
14/04/07 16:44:36 INFO mapred.JobClient: Counters: 0
14/04/07 16:44:36 INFO driver.MahoutDriver: Program took 1749 ms (Minutes: 0.02915)

I am using Mahout 0.9, Hadoop 1.2.1 and OpenJDK Java7u25

defining MAHOUT_HEAPSIZE to 4096 did not help, and the text files can be found here: http://static.echonest.com/Lastfm-ArtistTags2007.tar.gz

Answers


Currently the spawned job is executed as a local job runner, the execution happens only in the node in which you fired the job. Specify the job tracker address by setting the property mapred.job.tracker in your mapred-site.xml inorder to make the execution distributed.

Execution in distributed mode might solve your outOfMemory issue

If you look at the environment variable HADOOP_CONF_DIR, its values is empty set its value using the following command export HADOOP_CONF_DIR=/etc/hadoop/conf. Make sure the value of the property mapred.job.tracker which should point to your jobTracker in /etc/hadoop/conf/mapred-site.xml configuration


Need Your Help

Looking at previous item in handlebars loop MeteorJS

mongodb meteor handlebars.js

I have a template in which I want to generate some HTML only if the current item has some different fields from the previous item.

How to use XPath contains() here?

xml xpath

I'm trying to learn xpath. I looked at the other contains() examples around here, but nothing that uses an AND operator. I can't get this to work: