Write image as the value of sequence file with pyspark

I am using pyspark to write sequence files, the key is image filename and the value is the image representing by bytestring

from PIL import Image

def get_image(filename):
 s = StringIO()
 im=io.imread(filename)
 io.imsave(s, im)
 return [(filename, s)]

rdd  =  sc.parallelize(filenames)
rdd.flatMap(get_image).saveAsSequenceFile("/user/myname/output")

but pyspark throws an exception which indicates that pickle does not support the format

Caused by: net.razorvine.pickle.InvalidOpcodeException: opcode not implemented: OBJ
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:224)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
    at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
    at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
    at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
    at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1298)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    ... 1 more

Answers


The OBJ opcode for pickling is used when you're trying to encode/decode a python class/object for serialization. In my case, I didn't intend to write an object to the sequence file, so the fix for me was just fixing that bug.

As for the overall ecosystem, the problem is that spark uses Pyrolite 4.13 but OBJ encoding/decoding wasn't introduced into the Pyrolite library until version 4.17. As far as what to do about this, I suppose you have a few options:

  1. Convince the spark maintainers, either through a pull request, or github issue, to use a later version of Pyrolite.
  2. Build your own version of Spark, using that version of Pyrolite
  3. Don't write classes/objects to your sequence files.

Need Your Help

make a div inside a div

html css

I've the below html code.

What is best way to catch errors and exceptions in error.log file in Zend Application?

exception zend-framework error-handling sugarcrm

I want to implement errorlogger in my zend application. I have created logger which I am using for debugging but Can anybody tell me what is the best way to log errors so that they are more readabl...