Re: /tmp fills up to 100GB when using a window function

2017-12-20 Thread Vadim Semenov
Ah, yes, I missed that part it's `spark.local.dir` spark.local.dir /tmp Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on

Re: /tmp fills up to 100GB when using a window function

2017-12-20 Thread Gourav Sengupta
I do think that there is an option to set the temporary shuffle location to a particular directory. While working with EMR I set it to /mnt1/. Let me know in case you are not able to find it. On Mon, Dec 18, 2017 at 8:10 PM, Mihai Iacob wrote: > This code generates files under /tmp...blockmgr...

Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Vadim Semenov
at 10:08 AM, Mihai Iacob wrote: > When does spark remove them? > > > Regards, > > *Mihai Iacob* > DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics > > > > - Original message - > From: Vadim Semenov > To: Mihai Iacob >

Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Mihai Iacob
When does spark remove them?   Regards,  Mihai IacobDSX Local - Security, IBM Analytics     - Original

Re: /tmp fills up to 100GB when using a window function

2017-12-19 Thread Vadim Semenov
Spark doesn't remove intermediate shuffle files if they're part of the same job. On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob wrote: > This code generates files under /tmp...blockmgr... which do not get > cleaned up after the job finishes. > > Anything wrong with the code below? or are there any

/tmp fills up to 100GB when using a window function

2017-12-18 Thread Mihai Iacob
This code generates files under /tmp...blockmgr... which do not get cleaned up after the job finishes.   Anything wrong with the code below? or are there any known issues with spark not cleaning up /tmp files?   window = Window.\ partitionBy('***', 'date_str').\ orderBy(