It's worth mentioning that leveraging HDFS caching in Spark doesn't work
smoothly out of the box right now. By default, cached files in HDFS will
have 3 on-disk replicas and only one of these will be an in-memory replica.
In its scheduling, Spark will prefer all equally, meaning that, even when
r
Great to know that! Thank you, Matei.
Best regards,
-chanwit
--
Chanwit Kaewkasi
linkedin.com/in/chanwit
On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia wrote:
> That API is something the HDFS administrator uses outside of any application
> to tell HDFS to cache certain files or directories.
On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia wrote:
> That API is something the HDFS administrator uses outside of any application
> to tell HDFS to cache certain files or directories. But once you’ve done
> that, any existing HDFS client accesses them directly from the cache.
Ah, yeah, sure
Is that true? I believe that API Chanwit is talking about requires
explicitly asking for files to be cached in HDFS.
Spark automatically benefits from the kernel's page cache (i.e. if
some block is in the kernel's page cache, it will be read more
quickly). But the explicit HDFS cache is a differen
That API is something the HDFS administrator uses outside of any application to
tell HDFS to cache certain files or directories. But once you’ve done that, any
existing HDFS client accesses them directly from the cache.
Matei
On May 12, 2014, at 11:10 AM, Marcelo Vanzin wrote:
> Is that true?
Yes, Spark goes through the standard HDFS client and will automatically benefit
from this.
Matei
On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi wrote:
> Hi all,
>
> Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
> sc.textFile() and other HDFS-related APIs?
>
> http://hadoop.apac