Hi Konstantinos,

Typically the data that you are seeing is from records being spilled to disk 
during groupBy/join operations, where the size of one (or multiple, for the 
join case) data sets exceeds what will fit in memory.

And yes, these files can get big, e.g. as big as the sum of your input data 
sizes.

If you split your data stream (one data set being processed by multiple 
operators) then the summed temp size can be multiplied.

You can specify multiple disks to use as temp directories (comma-separated list 
in Flink config), so that’s one way to avoid a single disk becoming too full.

You can take a single workflow and break it into multiple pieces that you run 
sequentially, as that can reduce the high water mark for total spilled files.

You can write intermediate results to a file, versus relying on spills. Though 
if you use HDFS, and HDFS is using the same disks in your cluster, that 
obviously won’t help, and in fact can be worse due to replication of data.

As far as auto-deletion goes, I don’t think Flink supports this. In our case, 
after a job has run we can a shell script (via ssh) on slaves to remove temp 
files.

— Ken

PS - note that logging can also chew up a lot of space, if you set the log 
level to DEBUG, due to HTTP wire traffic. 

> On Jul 10, 2019, at 3:51 AM, Papadopoulos, Konstantinos 
> <konstantinos.papadopou...@iriworldwide.com> wrote:
> 
> Hi all,
>  
> We are developing several batch processing applications using the DataSet API 
> of the Apache Flink.
> For the time being, we are facing an issue with one of our production 
> environments since its disk usage increase enormously. After a quick 
> investigation, we concluded that the /tmp/flink-io-{} directory (under the 
> parent directory of the Apache Flink deployment) contains files of more than 
> 1TB and we need to regularly delete them in order to return our system to its 
> proper functionality. On the first sight, there is no significant impact when 
> deleting these temp files. So, I need your help to answer the following 
> questions:
> What kind of data does it stored to the aforementioned directory?
> Why does the respective files have such an enormous size?
> How can we limit the size of the data written to the respective directory?
> Is there any way  to delete such files automatically when not needed yet?
>  
> Thanks in advance for your help,
> Konstantinos

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply via email to