Hi Rao,
Yes, I have created this ticket: https://issues.apache.org/jira/browse/SPARK-35066 <https://issues.apache.org/jira/browse/SPARK-35066> It's not assigned to anybody so I don't have any ETA on the fix or possible workarounds. Best Maziyar > On 18 May 2021, at 07:42, Rao, Abhishek (Nokia - IN/Bangalore) > <abhishek....@nokia.com> wrote: > > Hi Maziyar, Mich > > Do we have any ticket to track this? Any idea if this is going to be fixed in > 3.1.2? > > Thanks and Regards, > Abhishek > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Sent: Friday, April 9, 2021 2:11 PM > To: Maziyar Panahi <maziyar.pan...@iscpif.fr> > Cc: User <user@spark.apache.org> > Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x > > > Hi, > > Regarding your point: > > .... I won't be able to defend this request by telling Spark users the > previous major release was and still is more stable than the latest major > release ... > > With the benefit of hindsight version 3.1.1 was released recently and the > definition of stable (from a practical point of view) does not come into it > yet. That is perhaps the reason why some vendors like Cloudera are few > releases away from the latest version. In production what matters most is the > predictability and stability. You are not doing anything wrong by rolling it > back and awaiting further clarification and resolution on the error. > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Fri, 9 Apr 2021 at 08:58, Maziyar Panahi <maziyar.pan...@iscpif.fr > <mailto:maziyar.pan...@iscpif.fr>> wrote: > Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change > all the notebooks/scripts to switch back from 3.1.1 to 3.0.2. > > That's being said, I won't be able to defend this request by telling Spark > users the previous major release was and still is more stable than the latest > major release, something that made everything default to 3.1.1 (pyspark, > downloads, etc.). > > I'll see if I can open a ticket for this as well. > > > On 8 Apr 2021, at 17:27, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > > Well the normal course of action (considering laws of diminishing returns) > is that your mileage varies: > > Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding > reason why you have to use 3.1.1, you can set it aside and try it when you > have other use cases. For now I guess you can carry on with 3.0.1 as BAU. > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Thu, 8 Apr 2021 at 16:19, Maziyar Panahi <maziyar.pan...@iscpif.fr > <mailto:maziyar.pan...@iscpif.fr>> wrote: > I personally added the followings to my SparkSession in 3.1.1 and the result > was exactly the same as before (local master). The 3.1.1 is still 4-5 times > slower than 3.0.2 at least for that piece of code. I will do more > investigation to see how it does with other stuff, especially anything > without .transform or Spark ML related functions, but the small code I > provided on any dataset that is big enough to take a minute to finish will > show you the difference going from 3.0.2 to 3.1.1 by magnitude of 4-5: > > .config("spark.sql.adaptive.coalescePartitions.enabled", "false") > .config("spark.sql.adaptive.enabled", "false") > > > > On 8 Apr 2021, at 16:47, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > > spark 3.1.1 > > I enabled the parameter > > spark_session.conf.set("spark.sql.adaptive.enabled", "true") > > to see it effects > > in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client > > with 4 executors it crashed the cluster. > > I then reduced the number of executors to 2 and this time it ran OK but the > performance is worse > > I assume it adds some overhead? > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <maziyar.pan...@iscpif.fr > <mailto:maziyar.pan...@iscpif.fr>> wrote: > Thanks Sean, > > I have already tried adding that and the result is absolutely the same. > > The reason that config cannot be the reason (at least not alone) it's because > my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been > set to true the beginning of 3.0.0 and hasn't changed: > > - > https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution> > - > https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution > > <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution> > > So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, > unfortunately the issue is some where else. > > > On 8 Apr 2021, at 15:54, Sean Owen <sro...@gmail.com > <mailto:sro...@gmail.com>> wrote: > > Right, you already established a few times that the difference is the number > of partitions. Russell answered with what is almost surely the correct > answer, that it's AQE. In toy cases it isn't always a win. > Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up > more realistic workloads in general. > > On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr > <mailto:maziyar.pan...@iscpif.fr>> wrote: > So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For pyspark==3.0.2 > (stage "showString at NativeMethodAccessorImpl.java:0"): Finished in 10 > seconds For pyspark==3.1.1 (same stage "showString at > NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds As > you can see everything is literally the same between 3.0.2 and 3.1.1, number > of stages, number of tasks, Input, Output, Shuffle Read, Shuffle Write, > except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 10/12 > and the other 2 are the processing of the actual task which I shared > previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks with > 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6 (includes Apache > Spark 3.0.1, Scala 2.12) There is still a difference, over 20 seconds which > when it comes to the whole process being within a minute that is a big bump. > Not sure what it is, but until further notice, I will advise our users to not > use Spark/PySpark 3.1.1 locally or in Databricks. (there are other > optimizations, maybe it's not noticeable, but this is such a simple code and > it can become a bottleneck quickly in larger pipelines) > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com > <http://nabble.com/>.