Hi all, I am running Spark locally in one node and trying to sweep the memory size for performance tuning. The machine has 8 CPUs and 16G main memory, the dataset in my local disk is about 10GB. I have several quick questions and appreciate any comments.
1. Spark performs in-memory computing, but without using RDD.cache(), will anything be cached in memory at all? My guess is that, without RDD.cache(), only a small amount of data will be stored in OS buffer cache, and every iteration of computation will still need to fetch most data from disk every time, is that right? 2. To evaluate how caching helps with iterative computation, I wrote a simple program as shown below, which basically consists of one saveAsText() and three reduce() actions/stages. I specify "spark.driver.memory" to "15g", others by default. Then I run three experiments. * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) *val* *sc* = *new* *SparkContext*(conf) *val* *input* = sc.textFile(*"/InputFiles"*) *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) *val* *ITERATIONS* = *3* *for* (i *<-* *1* to *ITERATIONS*) { *val* *totallength* = input.filter(line*=>*line.contains(*"the"* )).map(s*=>*s.length).reduce((a,b)*=>*a+b) } (I) The first run: no caching at all. The application finishes in ~12 minutes (2.6min+3.3min+3.2min+3.3min) (II) The second run, I modified the code so that the input will be cached: *val input = sc.textFile("/InputFiles").cache()* The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! The storage page in Web UI shows 48% of the dataset is cached, which makes sense due to large java object overhead, and spark.storage.memoryFraction is 0.6 by default. (III) However, the third run, same program as the second one, but I changed "spark.driver.memory" to be "2g". The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! And UI shows 6% of the data is cached. *From the results we can see the reduce stages finish in seconds, how could that happen with only 6% cached? Can anyone explain?* I am new to Spark and would appreciate any help on this. Thanks! Jia