Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Mayur Rustagi
One reason could be that spark uses scratch disk space on intermediate calculations so as you perform calculations that data need to be flushed before you can leverage memory for operations. Second issue could be large intermediate data may push more data in rdd onto disk ( something I see in wareh

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Pierre Borckmans
Hi Matei, Could you enlighten us on this please? Thanks Pierre On 11 Apr 2014, at 14:49, Jérémy Subtil wrote: > Hi Xusen, > > I was convinced the cache() method would involve in-memory only operations > and has nothing to do with disks as the underlying default cache strategy is > MEMORY_O

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread Jérémy Subtil
Hi Xusen, I was convinced the cache() method would involve in-memory only operations and has nothing to do with disks as the underlying default cache strategy is MEMORY_ONLY. Am I missing something? 2014-04-11 11:44 GMT+02:00 尹绪森 : > Hi Pierre, > > 1. cache() would cost time to carry stuffs fro

Re: Behaviour of caching when dataset does not fit into memory

2014-04-11 Thread 尹绪森
Hi Pierre, 1. cache() would cost time to carry stuffs from disk to memory, so pls do not use cache() if your job is not an iterative one. 2. If your dataset is larger than memory amount, then there will be a replacement strategy to exchange data between memory and disk. 2014-04-11 0:07 GMT+08:0

Behaviour of caching when dataset does not fit into memory

2014-04-10 Thread Pierre Borckmans
Hi there, Just playing around in the Spark shell, I am now a bit confused by the performance I observe when the dataset does not fit into memory : - i load a dataset with roughly 500 million rows - i do a count, it takes about 20 seconds - now if I cache the RDD and do a count again (which will