Flink Stateful Functions 3.2.0 (Flink 1.14.3) All java embedded code. Parallelism 32 Standard Stateful Functions Tasks: router -> functions -> feedback
The Router reads from kinesis and routes to stateful functions. For some reason, one and only one of the router subtasks will have have a start delay around 60 seconds to 120 seconds. All the other router subtasks will be 307ms. During the 120 seconds, all the routers will stop routing (looks like backpressure), after the checkpoint is complete the routers will surge read and catch up. I also get these warnings in some of the taskmanager logs. 2022-05-05 13:43:14,118 WARN > org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl > [] - Time from receiving all checkpoint barriers/RPC to executing it > exceeded threshold: 132017ms > I am guessing now: It sure seems that one of the router subtasks gets behind, the checkpoint barrier gets sent to the subtask but it takes forever for it to process through it. Any thoughts/insights/suggestions would be appreciated. [image: image.png]