Re: Spark streaming job filling a lot of data in local spark nodes

Tathagata Das Mon, 05 Oct 2015 18:08:39 -0700

You could have it. But do remember that it is brute force blunt hammer to
forcefully delete everything older than the ttl. So if you are using some
broadcast variable across streaming batches, that broadcasted data will get
deleted as well, and jobs will start failing. You could get around that by
rebroadcasting periodically, and using the newly broadcasted object, rather
than the older object.


On Thu, Oct 1, 2015 at 5:59 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:

> We have limited disk space. So, can we have spark.cleaner.ttl to clean up
> the files? Or is there any setting that can cleanup old temp files?
>
> On Mon, Sep 28, 2015 at 7:02 PM, Shixiong Zhu <zsxw...@gmail.com> wrote:
>
>> These files are created by shuffle and just some temp files. They are not
>> necessary for checkpointing and only stored in your local temp directory.
>> They will be stored in "/tmp" by default. You can use `spark.local.dir` to
>> set the path if you find your "/tmp" doesn't have enough space.
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-09-29 1:04 GMT+08:00 swetha <swethakasire...@gmail.com>:
>>
>>>
>>> Hi,
>>>
>>> I see a lot of data getting filled locally as shown below from my
>>> streaming
>>> job. I have my checkpoint set to hdfs. But, I still see the following
>>> data
>>> filling my local nodes. Any idea if I can make this stored in hdfs
>>> instead
>>> of storing the data locally?
>>>
>>> -rw-r--r--  1        520 Sep 17 18:43 shuffle_23119_5_0.index
>>> -rw-r--r--  1 180564255 Sep 17 18:43 shuffle_23129_2_0.data
>>> -rw-r--r--  1 364850277 Sep 17 18:45 shuffle_23145_8_0.data
>>> -rw-r--r--  1  267583750 Sep 17 18:46 shuffle_23105_4_0.data
>>> -rw-r--r--  1  136178819 Sep 17 18:48 shuffle_23123_8_0.data
>>> -rw-r--r--  1  159931184 Sep 17 18:48 shuffle_23167_8_0.data
>>> -rw-r--r--  1        520 Sep 17 18:49 shuffle_23315_7_0.index
>>> -rw-r--r--  1        520 Sep 17 18:50 shuffle_23319_3_0.index
>>> -rw-r--r--  1   92240350 Sep 17 18:51 shuffle_23305_2_0.data
>>> -rw-r--r--  1   40380158 Sep 17 18:51 shuffle_23323_6_0.data
>>> -rw-r--r--  1  369653284 Sep 17 18:52 shuffle_23103_6_0.data
>>> -rw-r--r--  1  371932812 Sep 17 18:52 shuffle_23125_6_0.data
>>> -rw-r--r--  1   19857974 Sep 17 18:53 shuffle_23291_19_0.data
>>> -rw-r--r--  1  55342005 Sep 17 18:53 shuffle_23305_8_0.data
>>> -rw-r--r--  1   92920590 Sep 17 18:53 shuffle_23303_4_0.data
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-job-filling-a-lot-of-data-in-local-spark-nodes-tp24846.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: Spark streaming job filling a lot of data in local spark nodes

Reply via email to