I configured HDFS to cache file in HDFS's cache, like following: hdfs cacheadmin -addPool hibench
hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench But I didn't see much performance impacts, no matter how I configure dfs.datanode.max.locked.memory Is it possible that Spark doesn't know the data is in HDFS cache, and still read data from disk, instead of from HDFS cache? Thanks! Jia