I personally added the followings to my SparkSession in 3.1.1 and the result
was exactly the same as before (local master). The 3.1.1 is still 4-5 times
slower than 3.0.2 at least for that piece of code. I will do more investigation
to see how it does with other stuff, especially anything without .transform or
Spark ML related functions, but the small code I provided on any dataset that
is big enough to take a minute to finish will show you the difference going
from 3.0.2 to 3.1.1 by magnitude of 4-5:
.config("spark.sql.adaptive.coalescePartitions.enabled", "false")
.config("spark.sql.adaptive.enabled", "false")
> On 8 Apr 2021, at 16:47, Mich Talebzadeh <[email protected]> wrote:
>
> spark 3.1.1
>
> I enabled the parameter
>
> spark_session.conf.set("spark.sql.adaptive.enabled", "true")
>
> to see it effects
>
> in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client
>
> with 4 executors it crashed the cluster.
>
> I then reduced the number of executors to 2 and this time it ran OK but the
> performance is worse
>
> I assume it adds some overhead?
>
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss,
> damage or destruction of data or any other property which may arise from
> relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
> On Thu, 8 Apr 2021 at 15:05, Maziyar Panahi <[email protected]
> <mailto:[email protected]>> wrote:
> Thanks Sean,
>
> I have already tried adding that and the result is absolutely the same.
>
> The reason that config cannot be the reason (at least not alone) it's because
> my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been
> set to true the beginning of 3.0.0 and hasn't changed:
>
> -
> https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution
>
> <https://spark.apache.org/docs/3.1.1/sql-performance-tuning.html#adaptive-query-execution>
> -
> https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution
>
> <https://spark.apache.org/docs/3.0.2/sql-performance-tuning.html#adaptive-query-execution>
> -
> https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution
>
> <https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html#adaptive-query-execution>
> -
> https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution
>
> <https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html#adaptive-query-execution>
>
> So it can't be a good thing for 3.0.2 and a bad thing for 3.1.1,
> unfortunately the issue is some where else.
>
>> On 8 Apr 2021, at 15:54, Sean Owen <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Right, you already established a few times that the difference is the number
>> of partitions. Russell answered with what is almost surely the correct
>> answer, that it's AQE. In toy cases it isn't always a win.
>> Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up
>> more realistic workloads in general.
>>
>> On Thu, Apr 8, 2021 at 8:52 AM maziyar <[email protected]
>> <mailto:[email protected]>> wrote:
>> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For
>> pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"):
>> Finished in 10 seconds For pyspark==3.1.1 (same stage "showString at
>> NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds
>> As you can see everything is literally the same between 3.0.2 and 3.1.1,
>> number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle
>> Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes
>> 10/12 and the other 2 are the processing of the actual task which I shared
>> previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks
>> with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6 (includes
>> Apache Spark 3.0.1, Scala 2.12) There is still a difference, over 20
>> seconds which when it comes to the whole process being within a minute that
>> is a big bump. Not sure what it is, but until further notice, I will advise
>> our users to not use Spark/PySpark 3.1.1 locally or in Databricks. (there
>> are other optimizations, maybe it's not noticeable, but this is such a
>> simple code and it can become a bottleneck quickly in larger pipelines)
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com
>> <http://nabble.com/>.
>