It seems that the default serializer used by pyspark can't serialize a list
of functions.
I've seen some posts about trying to fix this by using dill to serialize
rather than pickle.
Does anyone know what the status of that project is, or whether there's
another easy workaround?
I've pasted a sample error message below. Here, regs is a function defined
in another file myfile.py that has been included on all workers via the
pyFiles argument to SparkContext: sc = SparkContext("local",
"myapp",pyFiles=["myfile.py"]).
File "runfile.py", line 45, in __init__
regsRDD = sc.parallelize([regs]*self.n)
File "/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py",
line 223, in parallelize
serializer.dump_stream(c, tempFile)
File
"/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line
182, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File
"/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line
118, in dump_stream
self._write_with_length(obj, stream)
File
"/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line
128, in _write_with_length
serialized = self.dumps(obj)
File
"/Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py", line
270, in dumps
def dumps(self, obj): return cPickle.dumps(obj, 2)
cPickle.PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.