FW: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Xiandi Zhang
Serializer that writes objects as a stream of (length, data) pairs, where C{length} is a 32-bit integer and data is C{length} bytes. Hence the limit on the size of object. On Thu, Oct 8, 2015 at 12:56 PM, XIANDI wrote: File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in

ValueError: can not serialize object larger than 2G

2015-10-08 Thread XIANDI
File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 126, in du

Partitions on RDDs

2015-09-22 Thread XIANDI
I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context: http://apache-spark-user-list.100