I don't think the segment-size will help here.

If I understand the code correctly, then we have a fixed number of segments (# = memory/segment size), and if all segments are full we spill _all_ current segments in memory to disk into a single file, and re-use this file for future spilling until we stopped spilling.

So, we do not create 1 file per segment, but 1 file per instance of no memory being available. And there should be at most 1 file per subtask at any point.

As such, increasing the segment size shouldn't have any effect; you have fewer segments, but the overall memory usage stays the same, and spilling should occur just as often. More memory probably also will not help, given that we will end up having to spill anyway with such a data volume (I suppose).

Are there possibly other sources for files being created?  (anything in the user-code?) Could it be something annoying like the buffers rapidly switching between spilling/reading from memory, creating a new file on each spill, overwhelming the OS?

On 9/17/2020 11:06 PM, Ken Krugler wrote:
Hi all,

When I do a leftOuterJoin(stream, JoinHint.REPARTITION_SORT_MERGE), I’m running 
into an IOException caused by too many open files.

The slaves in my YARN cluster (each with 48 slots and 320gb memory) are 
currently set up with a limit of 32767, so I really don’t want to crank this up 
much higher.

In perusing the code, I assume the issue is that SpillingBuffer.nextSegment() 
can open a writer per segment.

So if I have a lot of data (e.g. several TBs) that I’m joining, I can wind up with 
> 32K segments that are open for writing.

Does that make sense? And is the best solution currently (without say using 
more, smaller servers to reduce the per-server load) to increase the 
taskmanager.memory.segment-size configuration setting from 32K to something 
larger, so I’d have fewer active segments?

If I do that, any known side-effects I should worry about?

Thanks,

— Ken

PS - a CoGroup is happening at the same time, so it’s possible that’s also 
hurting (or maybe the real source of the problem), but the leftOuterJoin failed 
first.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply via email to