Hi Ocatavian,
Just out of curiosity, did you try persisting your RDD in serialized format
"MEMORY_AND_DISK_SER" or "MEMORY_ONLY_SER" ??
i.e. changing your :
"rdd.persist(MEMORY_AND_DISK)"
to
"rdd.persist(MEMORY_ONLY_SER)"
Regards
On Wed, Jun 10, 2015 at 7:27 AM, Imran Rashid wrote:
> I agree
I agree with Richard. It looks like the issue here is shuffling, and
shuffle data is always written to disk, so the issue is definitely not that
all the output of flatMap has to be stored in memory.
If at all possible, I'd first suggest upgrading to a new version of spark
-- even in 1.2, there we
Are you sure it's memory related? What is the disk utilization and IO
performance on the workers? The error you posted looks to be related to
shuffle trying to obtain block data from another worker node and failing to
do so in reasonable amount of time. It may still be memory related, but I'm
not s
I was tried using reduceByKey, without success.
I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey .
However, I got the same error as before, namely the error described here:
http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think
StorageLevel MEMORY_AND_DISK means spark will try to keep the data in
memory and if there isn't sufficient space then it will be shipped to the
disk.
Thanks
Best Regards
On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea
wro