[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383785#comment-15383785 ]
Maciej Bryński edited comment on SPARK-16321 at 7/19/16 8:27 AM: ----------------------------------------------------------------- Bingo. I think I found the answer. I did following query {code} o.where("id > 9000000").where('nested_column.id % 100000 = 0').select("id").collect() {code} Second where is to force Spark to read whole nested_column structure. SQL Query *Spark 1.6* Time: *32 sec* {code} == Physical Plan == Project [id#0] +- Filter (((nested_column#1.id % 100000) = 0) && (id#0 > 9000000)) +- Scan ParquetRelation[id#0,nested_column#1] InputPaths: file:some_file, PushedFilters: [GreaterThan(id,9000000)] {code} *Spark 2.0* Time: *65 sec* {code} == Physical Plan == Project [id#0] +- Filter ((isnotnull(id#0) && (id#0 > 9000000)) && ((nested_column#1.id % 100000) = 0)) +- Scan parquet [id#0,nested_column#1] Format: ParquetFormat, InputPaths: file:some_file, PushedFilters: [IsNotNull(id), GreaterThan(id,9000000)], ReadSchema: struct<id:int,nested_column:struct<id:int... {code} As you can see there is a difference in predicates. Spark 1.6: GreaterThan(id,9000000) Spark 2.0: IsNotNull(id), GreaterThan(id,9000000) was (Author: maver1ck): Bingo. I think I found the answer. I did following query {code} o.where("id > 9000000").where('nested_column.id % 100000 = 0').select("id").collect() {code} Second where is to force Spark to read whole nested_column structure. SQL Query *Spark 1.6* Time: *32 sec* {code} == Physical Plan == Project [id#0] +- Filter (((nested_column#1.id % 100000) = 0) && (id#0 > 9000000)) +- Scan ParquetRelation[id#0,nested_column#1] InputPaths: file:some_file, PushedFilters: [GreaterThan(id,9000000)] {code} *Spark 2.0 Time: *65 sec* {code} == Physical Plan == Project [id#0] +- Filter ((isnotnull(id#0) && (id#0 > 9000000)) && ((nested_column#1.id % 100000) = 0)) +- Scan parquet [id#0,nested_column#1] Format: ParquetFormat, InputPaths: file:some_file, PushedFilters: [IsNotNull(id), GreaterThan(id,9000000)], ReadSchema: struct<id:int,nested_column:struct<id:int... {code} As you can see there is a difference in predicates. Spark 1.6: GreaterThan(id,9000000) Spark 2.0: IsNotNull(id), GreaterThan(id,9000000) > Pyspark 2.0 performance drop vs pyspark 1.6 > ------------------------------------------- > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Maciej Bryński > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark2_nofilterpushdown.nps, spark2_trace.png, visualvm_spark16.png, > visualvm_spark2.png, visualvm_spark2_G1GC.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org