from:"XIANDI"

FW: ValueError: can not serialize object larger than 2G

2015-10-08 Thread Xiandi Zhang

Serializer that writes objects as a stream of (length, data) pairs, where C{length} is a 32-bit integer and data is C{length} bytes. Hence the limit on the size of object. On Thu, Oct 8, 2015 at 12:56 PM, XIANDI wrote: File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in

ValueError: can not serialize object larger than 2G

2015-10-08 Thread XIANDI

File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/serializers.py", line 126, in du

Partitions on RDDs

2015-09-22 Thread XIANDI

I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context: http://apache-spark-user-list.100