Maciej Bryński created SPARK-16321: --------------------------------------
Summary: Pyspark 2.0 performance drop vs pyspark 1.6 Key: SPARK-16321 URL: https://issues.apache.org/jira/browse/SPARK-16321 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Maciej Bryński I did some test on parquet file with many nested columns (about 30G in 400 partitions) and Spark 2.0 is 2x slower. {code} df = sqlctx.read.parquet(path) df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id % 100000 else []).collect() {code} Spark 1.6 -> 2.3 min Spark 2.0 -> 4.6 min (2x slower) I used BasicProfiler for this task and cumulative time was: Spark 1.6 - 4300 sec Spark 2.0 - 5800 sec Should I expect such a drop in performance ? I don't know how to prepare sample data to show the problem. Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org