zhijiang created FLINK-10367:
--------------------------------

             Summary: Avoid recursion stack overflow during releasing 
SingleInputGate
                 Key: FLINK-10367
                 URL: https://issues.apache.org/jira/browse/FLINK-10367
             Project: Flink
          Issue Type: Improvement
          Components: Network
    Affects Versions: 1.6.0, 1.5.3, 1.5.2, 1.5.1, 1.5.0
            Reporter: zhijiang
            Assignee: zhijiang


For task failure or canceling, the {{SingleInputGate#releaseAllResources}} will 
be invoked before task exits.

In the process of {{SingleInputGate#releaseAllResources}}, we first loop to 
release all the input channels, then destroy the {{BufferPool}}.  For 
{{RemoteInputChannel#releaseAllResources}}, it will return floating buffers to 
the {{BufferPool }}which will assign this recycled buffer to the other 
listeners ( {{RemoteInputChannel}}). 

It may exist recursive call in this process. If the listener is already 
released before, it will directly recycle this buffer to the {{BufferPool 
}}again, then {{BufferPool}} takes another listener to notify available buffer. 
The above process may be invoked repeatedly in recursive way.

If there are many input channels as listeners in the {{BufferPool}}, it will 
cause {{StackOverflow}} error because of recursion. And in our testing job, the 
scale of 10,000 input channels ever caused this error.

I think of two ways for solving this potential problem:
 # When the input channel is released, it should notify the {{BufferPool}} of 
unregistering this listener, otherwise it is inconsistent between them.
 # {{SingleInputGate}} should destroy the {{BufferPool}} first, then loop to 
release all the internal input channels. To do so, all the listeners in 
{{BufferPool}} will be removed during destroying, and the input channel will 
not have further interactions during {{RemoteInputChannel#releaseAllResources}}.

I prefer the second way to solve this problem, because we do not want to expand 
another interface method for removing buffer listener, further currently the 
internal data structure in {{BufferPool}} can not support remove a listener 
directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to