Anton Kalashnikov created FLINK-23041:
-----------------------------------------

             Summary: Change local alignment timeout back to the global time out
                 Key: FLINK-23041
                 URL: https://issues.apache.org/jira/browse/FLINK-23041
             Project: Flink
          Issue Type: Bug
            Reporter: Anton Kalashnikov
            Assignee: Anton Kalashnikov


Local alignment timeouts are very confusing and especially without timeout on 
the outputs, they can significantly delay timeouting to UC.

Problematic case is when all CBs are received with long delay because of the 
back pressure, but they arrive at the same time. Alignment time can be low 
(milliseconds), while start delay is ~1 minute. In that case checkpoint doesn't 
timeout to UC and is passing the responsibility to timeout down the stream.

 

So it is not so transparant for the user why and when AC switches to UC. As 
mentioned before, the start delay is not correlated with the alignment timeout 
because it doesn't take into account time in output buffer. the alignment time 
is not fully correlated with the alignment timeout because the alignment time 
doesn't take into account the barrier announcement.

 

Based on this, there is the proposal to change the semantic of alignmentTimeout 
configuration to such meaning:

*The time between the starting of checkpoint(on the checkpont coordinator) and 
the time when the checkpoint barrier will be received by task.*

By this definition, we will have kind of global timeout which says that if the 
AC isn't finished for alignmentTimeout time it will be switched to UC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to