Why is Spark 3.0.x is faster than Spark 3.1.x

maziyar Thu, 08 Apr 2021 04:07:23 -0700

Hi,

I have a simple code, when I run it in Spark/PySpark (tried both Scala and
Python) in 3.1.1 it finishes within 5 minutes. However, the same code, same
data, same SparkSession configs, just running it on Spark 3.0.2 or PySpark
3.0.2 will finish it within a minute. That's over 5x times faster in Spark
3.0.2.


My SparkSession in both tests:

val spark: SparkSession = SparkSession
    .builder()
    .appName("test")
    .master("local[*]")
    .config("spark.driver.memory", "16G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.kryoserializer.buffer.max","200M")
    .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
    .getOrCreate()

Environments that I tested Spark 3.0.x vs. Spark 3.1.x:
- Intellij
- spark-shell
- pyspark shell
- pure Python on pyspark==3.0.2 and pyspark==3.1.1

The code to reproduce and the initial report:
https://github.com/JohnSnowLabs/spark-nlp/issues/2739#issuecomment-815635930

I have observed that in Spark 3.1.1 (Spark UI), it ends up with 2 tasks out
of 12 doing most of the processing:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png>
 

But in Spark 3.0.2 all of the tasks are being processed in parallel at the
same time:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png>
 

So clearly, in Spark 3.0.x the repartition to 12 was respected and all the
tasks were executed at the same time in parallel. However, in Spark 3.1.x
for some reason (might be entirely my fault for not setting a config or
disabling a config) this parallelism was not respected.

Is there a config set automatically (or a new spark conf in spark 3.1.x)
that causes this unbalanced partitioning? I have checked the migration from
3.0.x to 3.1.x but couldn't find anything related:
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Why is Spark 3.0.x is faster than Spark 3.1.x

Reply via email to