Okay, the reason for the task delay within executor when some RDD in memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case Scheduler waits for *spark.locality.wait *3 seconds default. During this period, scheduler waits to launch a data-local task before giving up and launching it on a less-local node.
So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0. On Mon, Feb 1, 2016 at 2:02 PM, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Hi All, > > > Sample Spark application which reads a logfile from hadoop (1.2GB - 5 > RDD's created each approx 250MB data) and there are two jobs. Job A gets > the line with "a" and the Job B gets the line with "b". The spark > application is ran multiple times, each time with > different executor memory, and enable/disable cache() function. Job A > performance is same in all the runs as it has to read the entire data first > time from Disk. > > Spark Cluster - standalone mode with Spark Master, single worker node (12 > cores, 16GB memory) > > val logData = sc.textFile(logFile, 2) > var numAs = logData.filter(line => line.contains("a")).count() > var numBs = logData.filter(line => line.contains("b")).count() > > > *Job B (which has 5 tasks) results below:* > > *Run 1:* 1 executor with 2GB memory, 12 cores took 2 seconds [ran1 image] > > Since logData is not cached, the job B has to again read the 1.2GB > data from hadoop into memory and all the 5 tasks started parallel and each > took 2 sec (29ms for GC) and the > overall job completed in 2 seconds. > > *Run 2:* 1 executor with 2GB memory, 12 cores and logData is cached took > 4 seconds [ran2 image, ran2_cache image] > > val logData = sc.textFile(logFile, 2).cache() > > The Executor does not have enough memory to cache and hence again > needs to read the entire 1.2GB data from hadoop into memory. But since the > cache() is used, leads to lot of GC pause leading to slowness in task > completion. Each task started parallel and > completed in 4 seconds (more than 1 sec for GC). > > *Run 3: 1 executor with 6GB memory, 12 cores and logData is cached took 10 > seconds [ran3 image]* > > The Executor has memory that can fit 4 RDD partitions into memory but > 5th RDD it has to read from Hadoop. 4 tasks are started parallel and they > completed in 0.3 seconds without GC. But the 5th task which has to read RDD > from disk is started after 4 seconds, and gets completed in 2 seconds. > Analysing why the 5th task is not started parallel with other tasks or at > least why it is not started immediately after the other task completion. > > *Run 4:* 1 executor with 16GB memory , 12 cores and logData is cached > took 0.3 seconds [ran4 image] > > The executor has enough memory to cache all the 5 RDD. All 5 tasks > are started in parallel and gets completed within 0.3 seconds. > > > So Spark performs well when entire input data is in Memory or None. In > case of some RDD in memory and some from disk, there is a delay in > scheduling the fifth task, is it a expected behavior or a possible Bug. > > > > Thanks, > Prabhu Joseph > > > >