In-memory computing and cache() in Spark

Jia Zhan Fri, 16 Oct 2015 14:03:40 -0700

Hi all,

I am running Spark locally in one node and trying to sweep the memory size
for performance tuning. The machine has 8 CPUs and 16G main memory, the
dataset in my local disk is about 10GB. I have several quick questions and
appreciate any comments.


1. Spark performs in-memory computing, but without using RDD.cache(), will
anything be cached in memory at all? My guess is that, without RDD.cache(),
only a small amount of data will be stored in OS buffer cache, and every
iteration of computation will still need to fetch most data from disk every
time, is that right?

2. To evaluate how caching helps with iterative computation, I wrote a
simple program as shown below, which basically consists of one saveAsText()
and three reduce() actions/stages. I specify "spark.driver.memory" to
"15g", others by default. Then I run three experiments.

*       val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)

       *val* *sc* = *new* *SparkContext*(conf)

       *val* *input* = sc.textFile(*"/InputFiles"*)

      *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
*=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)

      *val* *ITERATIONS* = *3*

      *for* (i *<-* *1* to *ITERATIONS*) {

          *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
)).map(s*=>*s.length).reduce((a,b)*=>*a+b)

      }

(I) The first run: no caching at all. The application finishes in ~12
minutes (2.6min+3.3min+3.2min+3.3min)

(II) The second run, I modified the code so that the input will be cached:
                 *val input = sc.textFile("/InputFiles").cache()*
     The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
     The storage page in Web UI shows 48% of the dataset  is cached, which
makes sense due to large java object overhead, and
spark.storage.memoryFraction is 0.6 by default.

(III) However, the third run, same program as the second one, but I changed
"spark.driver.memory" to be "2g".
   The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
And UI shows 6% of the data is cached.

*From the results we can see the reduce stages finish in seconds, how could
that happen with only 6% cached? Can anyone explain?*

I am new to Spark and would appreciate any help on this. Thanks!

Jia

In-memory computing and cache() in Spark

Reply via email to