I personally added the followings to my SparkSession in 3.1.1 and the result was exactly the same as before (local master). The 3.1.1 is still 4-5 times slower than 3.0.2 at least for that piece of code. I will do more investigation to see how it does with other stuff, especially anything without .transform or Spark ML related functions, but the small code I provided on any dataset that is big enough to take a minute to finish will show you the difference going from 3.0.2 to 3.1.1 by magnitude of 4-5:
.config("spark.sql.adaptive.coalescePartitions.enabled", "false") .config("spark.sql.adaptive.enabled", "false") > On 8 Apr 2021, at 16:47, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > spark 3.1.1 > > I enabled the parameter > > spark_session.conf.set("spark.sql.adaptive.enabled", "true") > > to see it effects > > in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client > > with 4 executors it crashed the cluster. > > I then reduced the number of executors to 2 and this time it ran OK but the > performance is worse > > I assume it adds some overhead? > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <maziyar.pan...@iscpif.fr > <mailto:maziyar.pan...@iscpif.fr>> wrote: > Thanks Sean, > > I have already tried adding that and the result is absolutely the same. > > The reason that config cannot be the reason (at least not alone) it's because > my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been > set to true the beginning of 3.0.0 and hasn't changed: > > - > https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution> > > So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, > unfortunately the issue is some where else. > >> On 8 Apr 2021, at 15:54, Sean Owen <sro...@gmail.com >> <mailto:sro...@gmail.com>> wrote: >> >> Right, you already established a few times that the difference is the number >> of partitions. Russell answered with what is almost surely the correct >> answer, that it's AQE. In toy cases it isn't always a win. >> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up >> more realistic workloads in general. >> >> On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr >> <mailto:maziyar.pan...@iscpif.fr>> wrote: >> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For >> pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): >> Finished in 10 seconds For pyspark==3.1.1 (same stage "showString at >> NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds >> As you can see everything is literally the same between 3.0.2 and 3.1.1, >> number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle >> Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes >> 10/12 and the other 2 are the processing of the actual task which I shared >> previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks >> with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6 (includes >> Apache Spark 3.0.1, Scala 2.12) There is still a difference, over 20 >> seconds which when it comes to the whole process being within a minute that >> is a big bump. Not sure what it is, but until further notice, I will advise >> our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there >> are other optimizations, maybe it's not noticeable, but this is such a >> simple code and it can become a bottleneck quickly in larger pipelines) >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com >> <http://nabble.com/>. >