Re[10]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-07 Thread Сергей Романов
63 0.682888031006 >64 0.691393136978 >65 0.690823078156 >66 0.70525097847 >67 0.724694013596 >68 0.737638950348 >69 0.749594926834 > > >Yong > >------ >From: Davies Liu < dav...@databricks.com > >Sent: Tuesday, September 6, 2016 2:27 PM >To: Сергей Ром

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Yong Zhang
9594926834 Yong From: Davies Liu Sent: Tuesday, September 6, 2016 2:27 PM To: Сергей Романов Cc: Gavin Yue; Mich Talebzadeh; user Subject: Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation. I think the slowness is caused by

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Davies Liu
I think the slowness is caused by generated aggregate method has more than 8K bytecodes, than it's not JIT compiled, became much slower. Could you try to disable the DontCompileHugeMethods by: -XX:-DontCompileHugeMethods On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов wrote: > Hi, Gavin, > > Shu

Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-05 Thread Сергей Романов
Hi, Gavin, Shuffling is exactly the same in both requests and is minimal. Both requests produces one shuffle task. Running time is the only difference I can see in metrics: timeit.timeit(spark.read.csv('file:///data/dump/test_csv', schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect,

Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Gavin Yue
Any shuffling? > On Sep 3, 2016, at 5:50 AM, Сергей Романов wrote: > > Same problem happens with CSV data file, so it's not parquet-related either. > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ vers

Re[7]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
And even more simple case: >>> df = sc.parallelize([1] for x in xrange(760857)).toDF() >>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] * >>> x)).collect, number=1) 50 1.91226291656 51 1.50933384895 52 1.582903862 53 1.90537405014 54 1.84442877769 55 1.9177978 56

Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Same problem happens with CSV data file, so it's not parquet-related either. Welcome to     __ / __/__  ___ _/ /__     _\ \/ _ \/ _ `/ __/  '_/    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0   /_/ Using Python version 2.7.6 (default, Jun 22 2015 17:58:13) SparkSessi

Re[5]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Hi, I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example) Please, can you take a look at it? Why there is performance drop after 57 sum columns? Welcome to     __ / __/__  ___ _/ /__     _\

Re[4]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Hi, Mich, I don't think it is related to Hive or parquet partitioning. Same issue happens while working with non-partitioned parquet file using python Dataframe API. Please, take a look at following example: $ hdfs dfs -ls /user/test   // I had copied partition dt=2016-07-28 to another standalon

Re: Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Mich Talebzadeh
Since you are using Spark Thrift Server (which in turn uses Hive Thrift Server) I have this suspicion that it uses Hive optimiser which indicates that stats do matter. However, that may be just an assumption. Have you partitioned these parquet tables? Is it worth logging to Hive and run the same

Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Сергей Романов
Hi, Mich, Column x29 does not seems to be any special. It's a newly created table and I did not calculate stats for any columns. Actually, I can sum a single column several times in query and face some landshift performance hit at some "magic" point. Setting "set spark.sql.codegen.wholeStage=f

Re: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-01 Thread Mich Talebzadeh
What happens if you run the following query on its own. How long it takes? SELECT field, SUM(x29) FROM FROM parquet_table WHERE partition = 1 GROUP BY field Have Stats been updated for all columns in Hive? And the type x29 field? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/p

Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-01 Thread Сергей Романов
Hi, When I run a query like "SELECT field, SUM(x1), SUM(x2)... SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field" it runs in under 2 seconds, but when I add just one more aggregate field to the query "SELECT field, SUM(x1), SUM(x2)... SUM(x28), SUM(x29) FROM parquet_table WHERE pa

Re: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-01 Thread Romanov
Can this be related to SPARK-17115 ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-SQL-runs-5x-times-slower-when-adding-29th-field-to-aggregation-tp27624p27643.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---