63 0.682888031006
>64 0.691393136978
>65 0.690823078156
>66 0.70525097847
>67 0.724694013596
>68 0.737638950348
>69 0.749594926834
>
>
>Yong
>
>------
>From: Davies Liu < dav...@databricks.com >
>Sent: Tuesday, September 6, 2016 2:27 PM
>To: Сергей Ром
9594926834
Yong
From: Davies Liu
Sent: Tuesday, September 6, 2016 2:27 PM
To: Сергей Романов
Cc: Gavin Yue; Mich Talebzadeh; user
Subject: Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field
to aggregation.
I think the slowness is caused by
I think the slowness is caused by generated aggregate method has more
than 8K bytecodes, than it's not JIT compiled, became much slower.
Could you try to disable the DontCompileHugeMethods by:
-XX:-DontCompileHugeMethods
On Mon, Sep 5, 2016 at 4:21 AM, Сергей Романов
wrote:
> Hi, Gavin,
>
> Shu
Hi, Gavin,
Shuffling is exactly the same in both requests and is minimal. Both requests
produces one shuffle task. Running time is the only difference I can see in
metrics:
timeit.timeit(spark.read.csv('file:///data/dump/test_csv',
schema=schema).groupBy().sum(*(['dd_convs'] * 57) ).collect,
Any shuffling?
> On Sep 3, 2016, at 5:50 AM, Сергей Романов wrote:
>
> Same problem happens with CSV data file, so it's not parquet-related either.
>
>
> Welcome to
> __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
>/__ / .__/\_,_/_/ /_/\_\ vers
And even more simple case:
>>> df = sc.parallelize([1] for x in xrange(760857)).toDF()
>>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] *
>>> x)).collect, number=1)
50 1.91226291656
51 1.50933384895
52 1.582903862
53 1.90537405014
54 1.84442877769
55 1.9177978
56
Same problem happens with CSV data file, so it's not parquet-related either.
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSessi
Hi,
I had narrowed down my problem to a very simple case. I'm sending 27kb parquet
in attachment. (file:///data/dump/test2 in example)
Please, can you take a look at it? Why there is performance drop after 57 sum
columns?
Welcome to
__
/ __/__ ___ _/ /__
_\
Hi, Mich,
I don't think it is related to Hive or parquet partitioning. Same issue happens
while working with non-partitioned parquet file using python Dataframe API.
Please, take a look at following example:
$ hdfs dfs -ls /user/test // I had copied partition dt=2016-07-28 to another
standalon
Since you are using Spark Thrift Server (which in turn uses Hive Thrift
Server) I have this suspicion that it uses Hive optimiser which indicates
that stats do matter. However, that may be just an assumption.
Have you partitioned these parquet tables?
Is it worth logging to Hive and run the same
Hi, Mich,
Column x29 does not seems to be any special. It's a newly created table and I
did not calculate stats for any columns. Actually, I can sum a single column
several times in query and face some landshift performance hit at some "magic"
point. Setting "set spark.sql.codegen.wholeStage=f
What happens if you run the following query on its own. How long it takes?
SELECT field, SUM(x29) FROM FROM parquet_table WHERE partition = 1 GROUP BY
field
Have Stats been updated for all columns in Hive? And the type x29 field?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/p
Hi,
When I run a query like "SELECT field, SUM(x1), SUM(x2)... SUM(x28) FROM
parquet_table WHERE partition = 1 GROUP BY field" it runs in under 2 seconds,
but when I add just one more aggregate field to the query "SELECT field,
SUM(x1), SUM(x2)... SUM(x28), SUM(x29) FROM parquet_table WHERE pa
Can this be related to SPARK-17115 ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-SQL-runs-5x-times-slower-when-adding-29th-field-to-aggregation-tp27624p27643.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
14 matches
Mail list logo