Is it possible to re-run your job with spark.eventLog.enabled to true, and send the resulting logs to the list? Those have more per-task information that can help diagnose this.
-Kay On Wed, Jan 21, 2015 at 1:57 AM, Fengyun RAO <raofeng...@gmail.com> wrote: > btw: Shuffle Write(11 GB) mean 11 GB per Executor, for each task, it's ~40 > MB > > > 2015-01-21 17:53 GMT+08:00 Fengyun RAO <raofeng...@gmail.com>: > >> I don't know how to debug distributed application, any tools or >> suggestion? >> >> but from spark web UI, >> >> the GC time (~0.1 s), Shuffle Write(11 GB) are similar for spark 1.1 and >> 1.2. >> there are no Shuffle Read and Spill. >> The only difference is Duration >> DurationMin25th percentileMedian75th percentileMaxspark 1.24s37s45s53s1.9 >> minspark 1.12 s17 s18 s18 s34 s >> >> 2015-01-21 16:56 GMT+08:00 Sean Owen <so...@cloudera.com>: >> >>> I mean that if you had tasks running on 10 machines now instead of 3 for >>> some reason you would have more than 3 times the read load on your source >>> of data all at once. Same if you made more executors per machine. But from >>> your additional info it does not sound like this is the case. I think you >>> need more debugging to pinpoint what is slower. >>> On Jan 21, 2015 9:30 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote: >>> >>>> thanks, Sean. >>>> >>>> I don't quite understand "you have *more *partitions across *more * >>>> workers". >>>> >>>> It's within the same cluster, and the same data, thus I think the same >>>> partition, the same workers. >>>> >>>> we switched from spark 1.1 to 1.2, then it's 3x slower. >>>> >>>> (We upgrade from CDH 5.2.1 to CDH 5.3, hence spark 1.1 to 1.2, and >>>> found the problem. >>>> then we installed a standalone spark 1.1, stop the 1.2, run the same >>>> script, it's 3x faster. >>>> stop 1.1, start 1.2, 3x slower again) >>>> >>>> >>>> 2015-01-21 15:45 GMT+08:00 Sean Owen <so...@cloudera.com>: >>>> >>>>> I don't know of any reason to think the singleton pattern doesn't work >>>>> or works differently. I wonder if, for example, task scheduling is >>>>> different in 1.2 and you have more partitions across more workers and so >>>>> are loading more copies more slowly into your singletons. >>>>> On Jan 21, 2015 7:13 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote: >>>>> >>>>>> the LogParser instance is not serializable, and thus cannot be a >>>>>> broadcast, >>>>>> >>>>>> what’s worse, it contains an LRU cache, which is essential to the >>>>>> performance, and we would like to share among all the tasks on the same >>>>>> node. >>>>>> >>>>>> If it is the case, what’s the recommended way to share a variable >>>>>> among all the tasks within the same executor. >>>>>> >>>>>> >>>>>> 2015-01-21 15:04 GMT+08:00 Davies Liu <dav...@databricks.com>: >>>>>> >>>>>>> Maybe some change related to serialize the closure cause LogParser is >>>>>>> not a singleton any more, then it is initialized for every task. >>>>>>> >>>>>>> Could you change it to a Broadcast? >>>>>>> >>>>>>> On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO <raofeng...@gmail.com> >>>>>>> wrote: >>>>>>> > Currently we are migrating from spark 1.1 to spark 1.2, but found >>>>>>> the >>>>>>> > program 3x slower, with nothing else changed. >>>>>>> > note: our program in spark 1.1 has successfully processed a whole >>>>>>> year data, >>>>>>> > quite stable. >>>>>>> > >>>>>>> > the main script is as below >>>>>>> > >>>>>>> > sc.textFile(inputPath) >>>>>>> > .flatMap(line => LogParser.parseLine(line)) >>>>>>> > .groupByKey(new HashPartitioner(numPartitions)) >>>>>>> > .mapPartitionsWithIndex(...) >>>>>>> > .foreach(_ => {}) >>>>>>> > >>>>>>> > where LogParser is a singleton which may take some time to >>>>>>> initialized and >>>>>>> > is shared across the execuator. >>>>>>> > >>>>>>> > the flatMap stage is 3x slower. >>>>>>> > >>>>>>> > We tried to change spark.shuffle.manager back to hash, and >>>>>>> > spark.shuffle.blockTransferService back to nio, but didn’t help. >>>>>>> > >>>>>>> > May somebody explain possible causes, or what should we test or >>>>>>> change to >>>>>>> > find it out >>>>>>> >>>>>> >>>>>> >>>> >> >