Hi, Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is sometimes 2x slower.
I tested following queries: 1) select count(*) where id > some_id In this query we have PPD and performance is similar. (about 1 sec) 2) select count(*) where nested_column.id > some_id Spark 1.6 -> 1.6 min Spark 2.0 -> 2.1 min Is it normal that both version didn't do PPD ? 3) Spark connected with python df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id % 100000 else []).collect() Spark 1.6 -> 2.3 min Spark 2.0 -> 4.6 min (2x slower) I used BasicProfiler for this task and cumulative time was: Spark 1.6 - 4300 sec Spark 2.0 - 5800 sec Should I expect such a drop in performance ? BTW: why in Spark 2.0 Dataframe lost map and flatmap method ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? I'd like to create Jira for it but Apache server is down at the moment. Regards, -- Maciek Bryński --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org