Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster.

I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. Looked for results in log and everything is working fine.

Now I need to add the spark installation commands and configurations files in my vagrantfile. I can't find a good guide, could someone give me a good link ?

I used this guide for the yarn cluster

http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation

Thanks in advance!

Answers


I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps.

Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). Then decompress it:

tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt

Now you only need to set some environment variables. First in your ~/.bashrc (or ~/.zshrc) you can set SPARK_HOME and add it to your PATH if you want:

export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin

Also for this changes to take effect you can run:

source ~/.bashrc

Second you need to point Spark to your Hadoop configuartion directories. To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh:

export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]

If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there.

Now to start the shell in yarn mode you can run:

spark-shell --master yarn --deploy-mode client

(You can't run the shell in cluster deploy-mode)

----------- Update

I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks @JulianCienfuegos):

spark-submit --master yarn --deploy-mode cluster project-spark.py

This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed).

You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (e.g. you press Ctrl+C, or your session ends)


Need Your Help