Re: flatMap output on disk / flatMap memory overhead

2015-08-01 Thread Puneet Kapoor
Hi Ocatavian, Just out of curiosity, did you try persisting your RDD in serialized format "MEMORY_AND_DISK_SER" or "MEMORY_ONLY_SER" ?? i.e. changing your : "rdd.persist(MEMORY_AND_DISK)" to "rdd.persist(MEMORY_ONLY_SER)" Regards On Wed, Jun 10, 2015 at 7:27 AM, Imran Rashid wrote: > I agree

Re: flatMap output on disk / flatMap memory overhead

2015-06-09 Thread Imran Rashid
I agree with Richard. It looks like the issue here is shuffling, and shuffle data is always written to disk, so the issue is definitely not that all the output of flatMap has to be stored in memory. If at all possible, I'd first suggest upgrading to a new version of spark -- even in 1.2, there we

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Richard Marscher
Are you sure it's memory related? What is the disk utilization and IO performance on the workers? The error you posted looks to be related to shuffle trying to obtain block data from another worker node and failing to do so in reasonable amount of time. It may still be memory related, but I'm not s

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread octavian.ganea
I was tried using reduceByKey, without success. I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . However, I got the same error as before, namely the error described here: http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-

Re: flatMap output on disk / flatMap memory overhead

2015-06-01 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea wro