Re: Spark performance regression test suite

2016-07-08 Thread Eric Liang
Something like speed.pypy.org or the Chrome performance dashboards would be very useful. On Fri, Jul 8, 2016 at 9:50 AM Holden Karau wrote: > There are also the spark-pe

Re: Spark performance regression test suite

2016-07-08 Thread Holden Karau
There are also the spark-perf and spark-sql-perf projects in the Databricks github (although I see an open issue for Spark 2.0 support in one of them). On Friday, July 8, 2016, Ted Yu wrote: > Found a few issues: > > [SPARK-6810] Performance benchmarks for SparkR > [SPARK-2833] performance tests

Re: Spark performance regression test suite

2016-07-08 Thread Ted Yu
Found a few issues: [SPARK-6810] Performance benchmarks for SparkR [SPARK-2833] performance tests for linear regression [SPARK-15447] Performance test for ALS in Spark 2.0 Haven't found one for Spark core. On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman wrote: > Hello, > > I've seen a few messa

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Ted Yu
bq. we turned it off when fixing a bug Adam: Can you refer to the bug JIRA ? Thanks On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts wrote: > Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned > vs 2.0.0 default vs 1.6.2 default comparison, for future reference the > defaults

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned vs 2.0.0 default vs 1.6.2 default comparison, for future reference the defaults in Spark 2 RC2 look to be: sql.shuffle.partitions: 200 Tungsten enabled: true Executor memory: 1 GB (we set to 18 GB) kryo buffer max: 64mb Who

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Here are some settings we use for some very large GraphX jobs. These are based on using EC2 c3.8xl workers: .set("spark.sql.shuffle.partitions", "1024") .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.memory", "24g") .set("spark.kryoserializer.buffer.max","1g")

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Hi Adam, From our experience we've found the default Spark 2.0 configuration to be highly suboptimal. I don't know if this affects your benchmarks, but I would consider running some tests with tuned and alternate configurations. Michael > On Jul 8, 2016, at 8:58 AM, Adam Roberts wrote: > >

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Hi Michael, the two Spark configuration files aren't very exciting spark-env.sh Same as the template apart from a JAVA_HOME setting spark-defaults.conf spark.io.compression.codec lzf config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g,

Spark performance regression test suite

2016-07-08 Thread Michael Allman
Hello, I've seen a few messages on the mailing list regarding Spark performance concerns, especially regressions from previous versions. It got me thinking that perhaps an automated performance regression suite would be a worthwhile contribution? Is anyone working on this? Do we have a Jira iss

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Michael Allman
Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael > On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: > > Hi, we've been testing the performance of Spark 2.0 compared to previous > releases, unfort

Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf

Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Adam Roberts
Hi, sharing what I discovered with PySpark too, corroborates with what Amit notices and also interested in the pipe question: h ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3c201603291521.u2tflbfo024...@d06av05.portsmouth.uk.ibm.com%3E // Start a thread to feed the process inp

Re: Stopping Spark executors

2016-07-08 Thread Jacek Laskowski
Hi, Read the doc http://spark.apache.org/docs/latest/spark-standalone.html which seems to be the cluster manager the OP uses. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklask