bq. we turned it off when fixing a bug Adam: Can you refer to the bug JIRA ?
Thanks On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: > Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned > vs 2.0.0 default vs 1.6.2 default comparison, for future reference the > defaults in Spark 2 RC2 look to be: > > sql.shuffle.partitions: 200 > Tungsten enabled: true > Executor memory: 1 GB (we set to 18 GB) > kryo buffer max: 64mb > WholeStageCodegen: on I think, we turned it off when fixing a bug > offHeap.enabled: false > offHeap.size: 0 > > Cheers, > > > > > From: Michael Allman <mich...@videoamp.com> > To: Adam Roberts/UK/IBM@IBMGB > Cc: dev <dev@spark.apache.org> > Date: 08/07/2016 17:05 > Subject: Re: Spark 2.0.0 performance; potential large Spark core > regression > ------------------------------ > > > > Here are some settings we use for some very large GraphX jobs. These are > based on using EC2 c3.8xl workers: > > .set("spark.sql.shuffle.partitions", "1024") > .set("spark.sql.tungsten.enabled", "true") > .set("spark.executor.memory", "24g") > .set("spark.kryoserializer.buffer.max","1g") > .set("spark.sql.codegen.wholeStage", "true") > .set("spark.memory.offHeap.enabled", "true") > .set("spark.memory.offHeap.size", "25769803776") // 24 GB > > Some of these are in fact default configurations. Some are not. > > Michael > > > On Jul 8, 2016, at 9:01 AM, Michael Allman <*mich...@videoamp.com* > <mich...@videoamp.com>> wrote: > > Hi Adam, > > From our experience we've found the default Spark 2.0 configuration to be > highly suboptimal. I don't know if this affects your benchmarks, but I > would consider running some tests with tuned and alternate configurations. > > Michael > > > On Jul 8, 2016, at 8:58 AM, Adam Roberts <*arobe...@uk.ibm.com* > <arobe...@uk.ibm.com>> wrote: > > Hi Michael, the two Spark configuration files aren't very exciting > > * spark-env.sh* > Same as the template apart from a JAVA_HOME setting > > * spark-defaults.conf* > spark.io.compression.codec lzf > > * config.py* has the Spark home set, is running Spark standalone mode, we > run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 > memory fraction, 100 trials > > We can post the 1.6.2 comparison early next week, running lots of > iterations over the weekend once we get the dedicated time again > > Cheers, > > > > > > From: Michael Allman <*mich...@videoamp.com* <mich...@videoamp.com> > > > To: Adam Roberts/UK/IBM@IBMGB > Cc: dev <*dev@spark.apache.org* <dev@spark.apache.org>> > Date: 08/07/2016 16:44 > Subject: Re: Spark 2.0.0 performance; potential large Spark core > regression > ------------------------------ > > > > Hi Adam, > > Do you have your spark confs and your spark-env.sh somewhere where we can > see them? If not, can you make them available? > > Cheers, > > Michael > > On Jul 8, 2016, at 3:17 AM, Adam Roberts <*arobe...@uk.ibm.com* > <arobe...@uk.ibm.com>> wrote: > > Hi, we've been testing the performance of Spark 2.0 compared to previous > releases, unfortunately there are no Spark 2.0 compatible versions of > HiBench and SparkPerf apart from those I'm working on (see > *https://github.com/databricks/spark-perf/issues/108* > <https://github.com/databricks/spark-perf/issues/108>) > > With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean > regression with a very small scale factor and so we've generated a couple > of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. > We will gather a 1.6.2 comparison and increase the scale factor. > > Has anybody noticed a similar problem? My changes for SparkPerf and Spark > 2.0 are very limited and AFAIK don't interfere with Spark core > functionality, so any feedback on the changes would be much appreciated and > welcome, I'd much prefer it if my changes are the problem. > > A summary for your convenience follows (this matches what I've mentioned > on the SparkPerf issue above) > > 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 > No. Of Workers: 1 > Executor per Worker : 1 > Executor Memory: 18G > Driver Memory : 8G > Serializer: kryo > > 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: > -Xdisableexplicitgc -Xcompressedrefs > > Main changes I made for the benchmark itself > > - Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem > - MLAlgorithmTests use Vectors.fromML > - For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD > not wordStream.foreach > - KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext > instead of awaitTermination > - Trivial: we use compact not compact.render for outputting json > > > In Spark 2.0 the top five methods where we spend our time is as follows, > the percentage is how much of the overall processing time was spent in this > particular method: > 1. AppendOnlyMap.changeValue 44% > 2. SortShuffleWriter.write 19% > 3. SizeTracker.estimateSize 7.5% > 4. SizeEstimator.estimate 5.36% > 5. Range.foreach 3.6% > > and in 1.5.2 the top five methods are: > 1. AppendOnlyMap.changeValue 38% > 2. ExternalSorter.insertAll 33% > 3. Range.foreach 4% > 4. SizeEstimator.estimate 2% > 5. SizeEstimator.visitSingleObject 2% > > I see the following scores, on the left I have the test name followed by > the 1.5.2 time and then the 2.0.0 time > scheduling throughput: 5.2s vs 7.08s > agg by key; 0.72s vs 1.01s > agg by key int: 0.93s vs 1.19s > agg by key naive: 1.88s vs 2.02 > sort by key: 0.64s vs 0.8s > sort by key int: 0.59s vs 0.64s > scala count: 0.09s vs 0.08s > scala count w fltr: 0.31s vs 0.47s > > This is only running the Spark core tests (scheduling throughput through > scala-count-w-filtr, including all inbetween). > > Cheers, > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU > > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >