Spark and IPython on CentOS 7
I am experimenting with Hadoop and Spark, as the company I work for is getting ready to start spinning up Hadoop and want to use Spark and other resources to do a lot of machine learning on our data. Most of that falls to me, so I am preparing by learning on my own.
I have a machine I have setup as a single node Hadoop cluster. Here is what I have:
- CentOS 7 (minimal server install, added XOrg and OpenBox for GUI)
- Python 2.7
- Hadoop 2.7.2
- Spark 2.0.0
I followed these guides to set this up:
When I attempt to run 'pyspark' I get the following:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYHTON_OPTS instead.
I opened up the pyspark file in vi and examined it. I see a lot of stuff going on there, but I don't know where to start to make the corrections I need to make. My Spark installation is under:
The pyspark is under /opt/spark-latest/bin/ and my Hadoop installation (though I don't think this factors in) is /opt/hadoop/. I know there must be a change I need to make in the pyspark file somewhere, I just don't know where to being on this. I did some googling and found references to similar things, but nothing that indicated steps in order to fix this.
Can anyone give me a push in the right direction?
If just starting to learn Spark's compatibility in a Hadoop environment, at the moment, Spark 2.0 isn't officially supported (Cloudera CDH or Hortonworks HDP). I'll go ahead and assume your company isn't standing up Hadoop outside of one of those distributions (because enterprise support).
That being said, Spark 1.6 (and Hadoop 2.6) is the latest supported version. Reason being is that there are a few breaking changes in Spark 2.0.
Now, if using Spark 1.6, you shouldn't get those errors. Anaconda isn't completely necessary (PySpark and Scala shells should just work). If using Jupyter notebooks, you could look up Apache Toree, which I've had good success getting notebooks setup. Otherwise, Apache Zeppelin is probably the recommended notebook environment in a production Hadoop cluster.