Serializer that writes objects as a stream of (length, data) pairs,
where C{length} is a 32-bit integer and data is C{length} bytes.
Hence the limit on the size of object.
On Thu, Oct 8, 2015 at 12:56 PM, XIANDI wrote:
File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in
File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
process()
File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/hadoop/spark/python/pyspark/serializers.py", line 126, in
du
I'm always confused by the partitions. We may have many RDDs in the code. Do
we need to partition on all of them? Do the rdds get rearranged among all
the nodes whenever we do a partition? What is a wise way of doing
partitions?
--
View this message in context:
http://apache-spark-user-list.100