Hi Maciej, In Spark, projection pushdown is currently limited to top-level columns (StructFields). VideoAmp has very large parquet-based tables (many billions of records accumulated per day) with deeply nested schema (four or five levels), and we've spent a considerable amount of time optimizing query performance on these tables.
We have a patch internally that extends Spark to support projection pushdown for arbitrarily nested fields. This has resulted in a *huge* performance improvement for many of our queries, like 10x to 100x in some cases. I'm still putting the finishing touches on our port of this patch to Spark master and 2.0. We haven't done any specific benchmarking between versions, but I will do that when our patch is complete. We hope to contribute this functionality to the Spark project at some point in the near future, but it is not ready yet. I'm sorry I don't have any concrete advice for you, but I hope this helps shed some light on the current support in Spark for projection pushdown. Michael > On Jun 29, 2016, at 1:48 PM, Maciej Bryński <mac...@brynski.pl> wrote: > > Hi, > Did anyone measure performance of Spark 2.0 vs Spark 1.6 ? > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes 2x slower. > > I tested following queries: > 1) select count(*) where id > some_id > In this query we have PPD and performance is similar. (about 1 sec) > > 2) select count(*) where nested_column.id > some_id > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Is it normal that both version didn't do PPD ? > > 3) Spark connected with python > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id % > 100000 else []).collect() > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > > Should I expect such a drop in performance ? > > BTW: why in Spark 2.0 Dataframe lost map and flatmap method ? > > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > > I'd like to create Jira for it but Apache server is down at the moment. > > Regards, > -- > Maciek Bryński > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org