Re: Spark to utilize HDFS's mmap caching

2014-05-14 Thread Sandy Ryza
It's worth mentioning that leveraging HDFS caching in Spark doesn't work smoothly out of the box right now. By default, cached files in HDFS will have 3 on-disk replicas and only one of these will be an in-memory replica. In its scheduling, Spark will prefer all equally, meaning that, even when r

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Chanwit Kaewkasi
Great to know that! Thank you, Matei. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia wrote: > That API is something the HDFS administrator uses outside of any application > to tell HDFS to cache certain files or directories.

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia wrote: > That API is something the HDFS administrator uses outside of any application > to tell HDFS to cache certain files or directories. But once you’ve done > that, any existing HDFS client accesses them directly from the cache. Ah, yeah, sure

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin
Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a differen

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
That API is something the HDFS administrator uses outside of any application to tell HDFS to cache certain files or directories. But once you’ve done that, any existing HDFS client accesses them directly from the cache. Matei On May 12, 2014, at 11:10 AM, Marcelo Vanzin wrote: > Is that true?

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Matei Zaharia
Yes, Spark goes through the standard HDFS client and will automatically benefit from this. Matei On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi wrote: > Hi all, > > Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via > sc.textFile() and other HDFS-related APIs? > > http://hadoop.apac