Hi,
Did anyone measure performance of Spark 2.0 vs Spark 1.6 ?

I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes 2x slower.

I tested following queries:
1) select count(*) where id > some_id
In this query we have PPD and performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min
Is it normal that both version didn't do PPD ?

3) Spark connected with python
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
100000 else []).collect()
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

BTW: why in Spark 2.0 Dataframe lost map and flatmap method ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

I'd like to create Jira for it but Apache server is down at the moment.

Regards,
-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to