Re: Why is Spark 3.0.x faster than Spark 3.1.x

Maziyar Panahi Tue, 18 May 2021 00:30:28 -0700

Hi Rao,


Yes, I have created this ticket: 
https://issues.apache.org/jira/browse/SPARK-35066 
<https://issues.apache.org/jira/browse/SPARK-35066>

It's not assigned to anybody so I don't have any ETA on the fix or possible 
workarounds.

Best
Maziyar

> On 18 May 2021, at 07:42, Rao, Abhishek (Nokia - IN/Bangalore) 
> <abhishek....@nokia.com> wrote:
> 
> Hi Maziyar, Mich
>  
> Do we have any ticket to track this? Any idea if this is going to be fixed in 
> 3.1.2?
>  
> Thanks and Regards,
> Abhishek
>  
> From: Mich Talebzadeh <mich.talebza...@gmail.com> 
> Sent: Friday, April 9, 2021 2:11 PM
> To: Maziyar Panahi <maziyar.pan...@iscpif.fr>
> Cc: User <user@spark.apache.org>
> Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x
>  
>  
> Hi,
>  
> Regarding your point:
>  
> .... I won't be able to defend this request by telling Spark users the 
> previous major release was and still is more stable than the latest major 
> release ...
>  
> With the benefit of hindsight version 3.1.1 was released recently and the 
> definition of stable (from a practical point of view) does not come into it 
> yet. That is perhaps the reason why some vendors like Cloudera are few 
> releases away from the latest version. In production what matters most is the 
> predictability and stability. You are not doing anything wrong by rolling it 
> back and awaiting further clarification and resolution on the error.
>  
> HTH
> 
> 
>  
>    view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
>  
>  
> On Fri, 9 Apr 2021 at 08:58, Maziyar Panahi <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change 
> all the notebooks/scripts to switch back from 3.1.1 to 3.0.2. 
>  
> That's being said, I won't be able to defend this request by telling Spark 
> users the previous major release was and still is more stable than the latest 
> major release, something that made everything default to 3.1.1 (pyspark, 
> downloads, etc.).
>  
> I'll see if I can open a ticket for this as well.
> 
> 
> On 8 Apr 2021, at 17:27, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
>  
> Well the normal course of action (considering laws of diminishing returns)  
> is that your mileage varies:
>  
> Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding 
> reason why you have to use 3.1.1, you can set it aside and try it when you 
> have other use cases. For now I guess you can carry on with 3.0.1 as BAU.
>  
> HTH
>  
>  
>    view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
>  
>  
> On Thu, 8 Apr 2021 at 16:19, Maziyar Panahi <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> I personally added the followings to my SparkSession in 3.1.1 and the result 
> was exactly the same as before (local master). The 3.1.1 is still 4-5 times 
> slower than 3.0.2 at least for that piece of code. I will do more 
> investigation to see how it does with other stuff, especially anything 
> without .transform or Spark ML related functions, but the small code I 
> provided on any dataset that is big enough to take a minute to finish will 
> show you the difference going from 3.0.2 to 3.1.1 by magnitude of 4-5:
>  
> .config("spark.sql.adaptive.coalescePartitions.enabled", "false")
> .config("spark.sql.adaptive.enabled", "false")
>  
> 
> 
> On 8 Apr 2021, at 16:47, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
>  
> spark 3.1.1
>  
> I enabled the parameter
>  
> spark_session.conf.set("spark.sql.adaptive.enabled", "true")
>  
> to see it effects
>  
> in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client 
>  
> with 4 executors it crashed the cluster.
>  
> I then reduced the number of executors to 2 and this time it ran OK but the 
> performance is worse
>  
> I assume it adds some overhead?
>  
>  
>  
>    view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
>  
>  
> On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> Thanks Sean, 
>  
> I have already tried adding that and the result is absolutely the same.
>  
> The reason that config cannot be the reason (at least not alone) it's because 
> my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been 
> set to true the beginning of 3.0.0 and hasn't changed:
>  
> - 
> https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution>
>  
> So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, 
> unfortunately the issue is some where else.
> 
> 
> On 8 Apr 2021, at 15:54, Sean Owen <sro...@gmail.com 
> <mailto:sro...@gmail.com>> wrote:
>  
> Right, you already established a few times that the difference is the number 
> of partitions. Russell answered with what is almost surely the correct 
> answer, that it's AQE. In toy cases it isn't always a win. 
> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up 
> more realistic workloads in general.
>  
> On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For pyspark==3.0.2 
> (stage "showString at NativeMethodAccessorImpl.java:0"): Finished in 10 
> seconds For pyspark==3.1.1 (same stage "showString at 
> NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds As 
> you can see everything is literally the same between 3.0.2 and 3.1.1, number 
> of stages, number of tasks, Input, Output, Shuffle Read, Shuffle Write, 
> except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 10/12 
> and the other 2 are the processing of the actual task which I shared 
> previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks with 
> 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6 (includes Apache 
> Spark 3.0.1, Scala 2.12) There is still a difference, over 20 seconds which 
> when it comes to the whole process being within a minute that is a big bump. 
> Not sure what it is, but until further notice, I will advise our users to not 
> use Spark/PySpark 3.1.1 locally or in Databricks. (there are other 
> optimizations, maybe it's not noticeable, but this is such a simple code and 
> it can become a bottleneck quickly in larger pipelines) 
> Sent from the Apache Spark User List mailing list archive 
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com 
> <http://nabble.com/>.

Re: Why is Spark 3.0.x faster than Spark 3.1.x

Reply via email to