Hadoop Streaming with RVM does not find Gem
Original Question (long version below). Short version: Running hadoop streaming with a ruby script as mapper and rvm installed on all cluster nodes does not work. Because ruby is not known (and rvm not being loaded correctly) by the hadoop launched shell. Why?
I wanted to use wukong as a gem to create map/reduce jobs for hadoop. The problem is the wukong gem can not be loaded by hadoop (ie it is not found). Hadoop jobs show me the following error:
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require': cannot load such file -- wukong (LoadError) from /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/site_ruby/1.9.1/rubygems/custom_require.rb:36:in `require' from /tmp/mapr-hadoop/mapred/local/taskTracker/admin/jobcache/job_201207061102_0068/attempt_201207061102_0068_m_000000_0/work/./test.rb:6:in `<main>' java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109) at org.apache.hadoop.mapred.Child.main(Child.java:264)
However, doing cat somefile | ./test.rb --map works on all cluster machines as expected. Also I included some debug printing in my test file which I can retrieve von the hadoop logs. When running
$stderr.puts `gem list`
it yields all the gems including wukong, also
yields the examt same paths as it does when printing the $LOAD_PATH running a local (not launched by hadoop) ruby script.
Why does the hadoop launched ruby script not find the gem which is clearly installed and working correctly?
hadoop is launched as:
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar \ -libjars /opt/hypertable/current/lib/java/hypertable-0.9.5.6.jar,/opt/hypertable/current/lib/java/libthrift-0.8.0.jar \ -Dmapred.child.env="PATH=$PATH:/usr/local/rvm/bin/rvm" \ -mapper '/home/admin/wukong/test.rb --map' \ -file /home/admin/wukong/test.rb \ -reducer /bin/cat \ -input /test/test.rb \ -output /test/something2
are you using ruby?
make sure that you run:
rvm use 1.9.3
and echo $GEM_PATH should return something like:
if it does not it means the use command did not worked.
Somehow I managed to get it working by restarting all machines in the cluster. I assume rvm was not sourced correctly. Stange though that a reboot was needed.
Take a look here: http://zachmoshe.com/2015/02/23/use-ruby-gems-with-hadoop-streaming.html
It shows how to run ruby mappers with gems using Hadoop Streaming.
I know it's 3 years after you asked but maybe it'll help someone else..