Re: Avoiding "too many open files" during leftOuterJoin with Flink 1.11/batch

Chesnay Schepler Fri, 18 Sep 2020 02:32:37 -0700

I don't think the segment-size will help here.

If I understand the code correctly, then we have a fixed number ofsegments (# = memory/segment size), and if all segments are full wespill _all_ current segments in memory to disk into a single file, andre-use this file for future spilling until we stopped spilling.

So, we do not create 1 file per segment, but 1 file per instance of nomemory being available. And there should be at most 1 file per subtaskat any point.

As such, increasing the segment size shouldn't have any effect; you havefewer segments, but the overall memory usage stays the same, andspilling should occur just as often. More memory probably also will nothelp, given that we will end up having to spill anyway with such a datavolume (I suppose).

Are there possibly other sources for files being created? (anything inthe user-code?)Could it be something annoying like the buffers rapidly switchingbetween spilling/reading from memory, creating a new file on each spill,overwhelming the OS?


On 9/17/2020 11:06 PM, Ken Krugler wrote:

Hi all,

When I do a leftOuterJoin(stream, JoinHint.REPARTITION_SORT_MERGE), I’m running 
into an IOException caused by too many open files.

The slaves in my YARN cluster (each with 48 slots and 320gb memory) are 
currently set up with a limit of 32767, so I really don’t want to crank this up 
much higher.

In perusing the code, I assume the issue is that SpillingBuffer.nextSegment() 
can open a writer per segment.

So if I have a lot of data (e.g. several TBs) that I’m joining, I can wind up with 
> 32K segments that are open for writing.

Does that make sense? And is the best solution currently (without say using 
more, smaller servers to reduce the per-server load) to increase the 
taskmanager.memory.segment-size configuration setting from 32K to something 
larger, so I’d have fewer active segments?

If I do that, any known side-effects I should worry about?

Thanks,

— Ken

PS - a CoGroup is happening at the same time, so it’s possible that’s also 
hurting (or maybe the real source of the problem), but the leftOuterJoin failed 
first.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: Avoiding "too many open files" during leftOuterJoin with Flink 1.11/batch

Reply via email to