Re: Why is Spark 3.0.x faster than Spark 3.1.x

Maziyar Panahi Thu, 08 Apr 2021 07:04:26 -0700

Thanks Sean, 

I have already tried adding that and the result is absolutely the same.


The reason that config cannot be the reason (at least not alone) it's because 
my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been set 
to true the beginning of 3.0.0 and hasn't changed:

- 
https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution
 
<https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution>
- 
https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution
 
<https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution>
- 
https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution
 
<https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution>
- 
https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution
 
<https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution>

So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1, unfortunately 
the issue is some where else.

> On 8 Apr 2021, at 15:54, Sean Owen <sro...@gmail.com> wrote:
> 
> Right, you already established a few times that the difference is the number 
> of partitions. Russell answered with what is almost surely the correct 
> answer, that it's AQE. In toy cases it isn't always a win. 
> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up 
> more realistic workloads in general.
> 
> On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr 
> <mailto:maziyar.pan...@iscpif.fr>> wrote:
> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For pyspark==3.0.2 
> (stage "showString at NativeMethodAccessorImpl.java:0"):  Finished in 10 
> seconds For pyspark==3.1.1 (same stage "showString at 
> NativeMethodAccessorImpl.java:0"):   Finished the same stage in 39 seconds As 
> you can see everything is literally the same between 3.0.2 and 3.1.1, number 
> of stages, number of tasks, Input, Output, Shuffle Read, Shuffle Write, 
> except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes 10/12 
> and the other 2 are the processing of the actual task which I shared 
> previously: 3.1.1   3.0.2   PS: I have just made the same test in Databricks 
> with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12):   7.6 (includes 
> Apache Spark 3.0.1, Scala 2.12)   There is still a difference, over 20 
> seconds which when it comes to the whole process being within a minute that 
> is a big bump. Not sure what it is, but until further notice, I will advise 
> our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there are 
> other optimizations, maybe it's not noticeable, but this is such a simple 
> code and it can become a bottleneck quickly in larger pipelines) 
> Sent from the Apache Spark User List mailing list archive 
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.

Re: Why is Spark 3.0.x faster than Spark 3.1.x

Reply via email to