OK got it
Someone asked a similar but not related to shuffle question in Spark slack
channel.. This is a simple Python code that creates shuffle files in
shuffle_directory = "/tmp/spark_shuffles" and simulates working examples
using a loop and periodically cleans up shuffle files older than 1 se
Thanks for the suggestions Mich, Jörn, and Adam.
The rationale for long-lived app with loop versus submitting multiple yarn
applications is mainly for simplicity. Plan to run app on an multi-tenant EMR
cluster alongside other yarn apps. Implementing the loop outside the Spark app
will work but
Hi,
What do you propose or you think will help when these spark jobs are
independent of each other --> So once a job/iterator is complete, there is
no need to retain these shuffle files. You have a number of options to
consider starting from spark configuration parameters and so forth
https://spa
You can try to shuffle to s3 using the cloud shuffle plugin for s3
(https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/)
- the performance of the new plugin is for many spark jobs sufficient (it
works also on EMR). Then you can use s3 lifecycle po
If you're using dynamic allocation it could be caused by executors with
shuffle data being deallocated before the shuffle is cleaned up. These
shuffle files will never get cleaned up once that happens until the Yarn
application ends. This was a big issue for us so I added support for
deleting shuff