Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Kalin Stoyanov Wed, 15 Jan 2020 09:53:55 -0800

Hi all,

First of all let me say that I am pretty new to Spark so this could be
entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark
2.4.4, and it got done slower than when I had ran it locally (on Spark
2.4.1). I checked out the event logs, and the one from the newer version
had more stages.
Then I decided to do a comparison in the same environment so I created the
two versions of the same cluster with the only difference being the emr
release, and hence the spark version(?) - first one was emr-5.24.1 with
Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
the same thing happened with the newer version having more stages and
taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of
spark itself, but because of some difference introduced in the emr
releases? At the moment I can't think of any other alternative besides it
being a bug...


Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files
s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
--name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
--outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said
it's better to ask here first, so any ideas on what's going on? Maybe I am
missing something?

Regards,
Kalin

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Reply via email to