Re: Why is Spark 3.0.x faster than Spark 3.1.x

Maziyar Panahi Thu, 08 Apr 2021 08:19:44 -0700

I personally added the followings to my SparkSession in 3.1.1 and the result 
was exactly the same as before (local master). The 3.1.1 is still 4-5 times 
slower than 3.0.2 at least for that piece of code. I will do more investigation 
to see how it does with other stuff, especially anything without .transform or 
Spark ML related functions, but the small code I provided on any dataset that 
is big enough to take a minute to finish will show you the difference going 
from 3.0.2 to 3.1.1 by magnitude of 4-5:


.config("spark.sql.adaptive.coalescePartitions.enabled", "false")
.config("spark.sql.adaptive.enabled", "false")


> On 8 Apr 2021, at 16:47, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> spark 3.1.1
> 
> I enabled the parameter
> 
> spark_session.conf.set("spark.sql.adaptive.enabled", "true")
> 
> to see it effects
> 
> in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client 
> 
> with 4 executors it crashed the cluster.
> 
> I then reduced the number of executors to 2 and this time it ran OK but the 
> performance is worse
> 
> I assume it adds some overhead?
> 
> 
> 
>    view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> Thanks Sean, 
> 
> I have already tried adding that and the result is absolutely the same.
> 
> The reason that config cannot be the reason (at least not alone) it's because 
> my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been 
> set to true the beginning of 3.0.0 and hasn't changed:
> 
> - 
> https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution>
> - 
> https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution
>  
> <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution>
> 
> So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, 
> unfortunately the issue is some where else.
> 
>> On 8 Apr 2021, at 15:54, Sean Owen <sro...@gmail.com 
>> <mailto:sro...@gmail.com>> wrote:
>> 
>> Right, you already established a few times that the difference is the number 
>> of partitions. Russell answered with what is almost surely the correct 
>> answer, that it's AQE. In toy cases it isn't always a win. 
>> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up 
>> more realistic workloads in general.
>> 
>> On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr 
>> <mailto:maziyar.pan...@iscpif.fr>> wrote:
>> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For 
>> pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"):  
>> Finished in 10 seconds For pyspark==3.1.1 (same stage "showString at 
>> NativeMethodAccessorImpl.java:0"):   Finished the same stage in 39 seconds 
>> As you can see everything is literally the same between 3.0.2 and 3.1.1, 
>> number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle 
>> Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 
>> 10/12 and the other 2 are the processing of the actual task which I shared 
>> previously: 3.1.1   3.0.2   PS: I have just made the same test in Databricks 
>> with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12):   7.6 (includes 
>> Apache Spark 3.0.1, Scala 2.12)   There is still a difference, over 20 
>> seconds which when it comes to the whole process being within a minute that 
>> is a big bump. Not sure what it is, but until further notice, I will advise 
>> our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there 
>> are other optimizations, maybe it's not noticeable, but this is such a 
>> simple code and it can become a bottleneck quickly in larger pipelines) 
>> Sent from the Apache Spark User List mailing list archive 
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com 
>> <http://nabble.com/>.
>

Re: Why is Spark 3.0.x faster than Spark 3.1.x

Reply via email to