You can try to shuffle to s3 using the cloud shuffle plugin for s3
(https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/)
- the performance of the new plugin is for many spark jobs sufficient (it
works also on EMR). Then you can use s3 lifecycle po
If you're using dynamic allocation it could be caused by executors with
shuffle data being deallocated before the shuffle is cleaned up. These
shuffle files will never get cleaned up once that happens until the Yarn
application ends. This was a big issue for us so I added support for
deleting shuff