One reason could be that spark uses scratch disk space on intermediate
calculations so as you perform calculations that data need to be flushed
before you can leverage memory for operations.
Second issue could be large intermediate data may push more data in rdd
onto disk ( something I see in wareh
Hi Matei,
Could you enlighten us on this please?
Thanks
Pierre
On 11 Apr 2014, at 14:49, Jérémy Subtil wrote:
> Hi Xusen,
>
> I was convinced the cache() method would involve in-memory only operations
> and has nothing to do with disks as the underlying default cache strategy is
> MEMORY_O
Hi Xusen,
I was convinced the cache() method would involve in-memory only operations
and has nothing to do with disks as the underlying default cache strategy
is MEMORY_ONLY. Am I missing something?
2014-04-11 11:44 GMT+02:00 尹绪森 :
> Hi Pierre,
>
> 1. cache() would cost time to carry stuffs fro
Hi Pierre,
1. cache() would cost time to carry stuffs from disk to memory, so pls do
not use cache() if your job is not an iterative one.
2. If your dataset is larger than memory amount, then there will be a
replacement strategy to exchange data between memory and disk.
2014-04-11 0:07 GMT+08:0
Hi there,
Just playing around in the Spark shell, I am now a bit confused by the
performance I observe when the dataset does not fit into memory :
- i load a dataset with roughly 500 million rows
- i do a count, it takes about 20 seconds
- now if I cache the RDD and do a count again (which will