Piotr Nowojski created FLINK-13203:
--------------------------------------

             Summary: [proper fix] Deadlock occurs when requiring exclusive 
buffer for RemoteInputChannel
                 Key: FLINK-13203
                 URL: https://issues.apache.org/jira/browse/FLINK-13203
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.9.0
            Reporter: Piotr Nowojski


The issue is during requesting exclusive buffers with a timeout. Since 
currently the number of maximum buffers and the number of required buffers are 
not the same for local buffer pools, there may be cases that the local buffer 
pools of the upstream tasks occupy all the buffers while the downstream tasks 
fail to acquire exclusive buffers to make progress. As for 1.9 in 
https://issues.apache.org/jira/browse/FLINK-12852 deadlock was avoided by 
adding a timeout to try to failover the current execution when the timeout 
occurs and tips users to increase the number of buffers in the exception 
message.

In the discussion under the https://issues.apache.org/jira/browse/FLINK-12852 
there were numerous proper solutions discussed and as for now there is no 
consensus how to fix it:

1. Only allocate the minimum per producer, which is one buffer per channel. 
This would be needed to keep the requirement similar to what we have at the 
moment, but it is much less than we recommend for the credit-based network data 
exchange (2* channels + floating)

2a. Coordinate the deployment sink-to-source such that receivers always have 
their buffers first. This will be complex to implement and coordinate and break 
with many assumptions about tasks being independent (coordination wise) on the 
TaskManagers. Giving that assumption up will be a pretty big step and cause 
lot's of complexity in the future.
It will also increase deployment delays. Low deployment delays should be a 
design goal in my opinion, as it will enable other features more easily, like 
low-disruption upgrades, etc.

2b. Assign extra buffers only once all of the tasks are RUNNING. This is a 
simplified version of 2a, without tracking the tasks sink-to-source.

3. Make buffers always revokable, by spilling.
This is tricky to implement very efficiently, especially because there is the 
logic that slices buffers for early sends for the low-latency streaming stuff
the spilling request will come from an asynchronous call. That will probably 
stay like that even with the mailbox, because the main thread will be 
frequently blocked on buffer allocation when this request comes.

4. We allocate the recommended number for good throughput (2*numChannels + 
floating) per consumer and per producer.
No dynamic rebalancing any more. This would increase the number of required 
network buffers in certain high-parallelism scenarios quite a bit with the 
default config. Users can down-configure this by setting the per-channel 
buffers lower. But it would break user setups and require them to adjust the 
config when upgrading.

5. We make the network resource per slot and ask the scheduler to attach 
information about how many producers and how many consumers will be in the 
slot, worst case. We use that to pre-compute how many excess buffers the 
producers may take.
This will also break with some assumptions and lead us to the point that we 
have to pre-compute network buffers in the same way as managed memory. Seeing 
how much pain it is with the managed memory, this seems not so great.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to