Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
Thanks for the clarification on the partitioning. I did what you suggested and tried reading in individual part-* files -- some of them are ~1.7Gb in size and that's where it's failing. When I increase the number of partitions before writing to disk, it seems to work. Would be nice if this was som

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Davies Liu
HadoopRDD will try to split the file as 64M partitions in size, so you got 1916+ partitions. (assume 100k per row, they are 80G in size). I think it has very small chance that one object or one batch will be bigger than 2G. Maybe there are a bug when it split the pickled file, could you create a R

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
hi, thanks for the quick answer -- I suppose this is possible, though I don't understand how it could come about. The largest individual RDD elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of 800k of them. The file is saved in 134 parts, but is being read in using some 1916+

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread Davies Liu
Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves > without problems. When I try to read it back in, it fails with: