Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages. Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish. So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...
Here are the two event logs: https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing and my code is here: https://github.com/kgskgs/stars-spark3d I ran it like so on the clusters (after putting it on s3): spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something? Regards, Kalin