Hi, I have a simple code, when I run it in Spark/PySpark (tried both Scala and Python) in 3.1.1 it finishes within 5 minutes. However, the same code, same data, same SparkSession configs, just running it on Spark 3.0.2 or PySpark 3.0.2 will finish it within a minute. That's over 5x times faster in Spark 3.0.2.
My SparkSession in both tests: val spark: SparkSession = SparkSession .builder() .appName("test") .master("local[*]") .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max","200M") .config("spark.serializer","org.apache.spark.serializer.KryoSerializer") .getOrCreate() Environments that I tested Spark 3.0.x vs. Spark 3.1.x: - Intellij - spark-shell - pyspark shell - pure Python on pyspark==3.0.2 and pyspark==3.1.1 The code to reproduce and the initial report: https://github.com/JohnSnowLabs/spark-nlp/issues/2739#issuecomment-815635930 I have observed that in Spark 3.1.1 (Spark UI), it ends up with 2 tasks out of 12 doing most of the processing: <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png> But in Spark 3.0.2 all of the tasks are being processed in parallel at the same time: <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png> So clearly, in Spark 3.0.x the repartition to 12 was respected and all the tasks were executed at the same time in parallel. However, in Spark 3.1.x for some reason (might be entirely my fault for not setting a config or disabling a config) this parallelism was not respected. Is there a config set automatically (or a new spark conf in spark 3.1.x) that causes this unbalanced partitioning? I have checked the migration from 3.0.x to 3.1.x but couldn't find anything related: https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org