I should probably clarify that this is intermittent and it is a different subtask ID each time it does happen.
On Thu, May 5, 2022 at 4:25 PM Ammon Diether <adiet...@gmail.com> wrote: > Flink Stateful Functions 3.2.0 (Flink 1.14.3) > All java embedded code. > Parallelism 32 > Standard Stateful Functions Tasks: router -> functions -> feedback > > The Router reads from kinesis and routes to stateful functions. For some > reason, one and only one of the router subtasks will have have a start > delay around 60 seconds to 120 seconds. All the other router subtasks > will be 307ms. During the 120 seconds, all the routers will stop routing > (looks like backpressure), after the checkpoint is complete the routers > will surge read and catch up. > > I also get these warnings in some of the taskmanager logs. > > 2022-05-05 13:43:14,118 WARN >> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl >> [] - Time from receiving all checkpoint barriers/RPC to executing it >> exceeded threshold: 132017ms >> > > I am guessing now: It sure seems that one of the router subtasks gets > behind, the checkpoint barrier gets sent to the subtask but it takes > forever for it to process through it. > > Any thoughts/insights/suggestions would be appreciated. > > [image: image.png] >