Our use case is as follows:
We repartition 6 months worth of data for each client on clientId &
recordcreationdate, so that it can write one file per partition. Our
partition is on client and recordcreationdate.
The job fills up the disk after it process say 30 tenants out of 50. I am
looking
There's a second new mechanism which uses TTL for cleanup of shuffle files.
Can you share more about your use case?
On Mon, Sep 14, 2020 at 1:33 PM Edward Mitchell wrote:
> We've also had some similar disk fill issues.
>
> For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM
> ga
We've also had some similar disk fill issues.
For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM
garbage collection. I've noticed that if RDDs maintain references in the
code, and cannot be garbage collected, then immediate shuffle files hang
around.
Best way to handle this is b
Hi,
I have a long running application and spark seem to fill up the disk with
shuffle files. Eventually the job fails running out of disk space. Is there
a way for me to clean the shuffle files ?
Thanks
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-