So this is what I have in my Spark UI for 3.0.2 and 3.1.1:For pyspark==3.0.2
(stage "showString at NativeMethodAccessorImpl.java:0"):
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png>
Finished in 10 secondsFor pyspark==3.1.1 (same stage "showString at
NativeMethodAccessorImpl.java:0"):
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png>
Finished the same stage in 39 secondsAs you can see everything is literally
the same between 3.0.2 and 3.1.1, number of stages, number of tasks, Input,
Output, Shuffle Read, Shuffle Write, except the 3.0.2 runs all 12 tasks
together while the 3.1.1 finishes 10/12 and the other 2 are the processing
of the actual task which I shared previously:3.1.1
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png>
3.0.2
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png>
PS: I have just made the same test in Databricks with 1 worker8.1 (includes
Apache Spark 3.1.1, Scala 2.12):
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png>
7.6 (includes Apache Spark 3.0.1, Scala 2.12)
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/Screenshot_2021-04-08_at_15.png>
There is still a difference, over 20 seconds which when it comes to the
whole process being within a minute that is a big bump. Not sure what it is,
but until further notice, I will advise our users to not use Spark/PySpark
3.1.1 locally or in Databricks. (there are other optimizations, maybe it's
not noticeable, but this is such a simple code and it can become a
bottleneck quickly in larger pipelines)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Reply via email to