Thanks for the clarification on the partitioning.
I did what you suggested and tried reading in individual part-* files --
some of them are ~1.7Gb in size and that's where it's failing. When I
increase the number of partitions before writing to disk, it seems to work.
Would be nice if this was som
HadoopRDD will try to split the file as 64M partitions in size, so you
got 1916+ partitions.
(assume 100k per row, they are 80G in size).
I think it has very small chance that one object or one batch will be
bigger than 2G.
Maybe there are a bug when it split the pickled file, could you create
a R
hi, thanks for the quick answer -- I suppose this is possible, though I
don't understand how it could come about. The largest individual RDD
elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
800k of them. The file is saved in 134 parts, but is being read in using
some 1916+
Maybe it's caused by integer overflow, is it possible that one object
or batch bigger than 2G (after pickling)?
On Tue, Jan 27, 2015 at 7:59 AM, rok wrote:
> I've got an dataset saved with saveAsPickleFile using pyspark -- it saves
> without problems. When I try to read it back in, it fails with: