Flink Stateful Functions 3.2.0  (Flink 1.14.3)
All java embedded code.
Parallelism 32
Standard Stateful Functions Tasks:  router -> functions -> feedback

The Router reads from kinesis and routes to stateful functions.  For some
reason, one and only one of the router subtasks will have have a start
delay around 60 seconds to 120 seconds.   All the other router subtasks
will be 307ms.  During the 120 seconds, all the routers will stop routing
(looks like backpressure), after the checkpoint is complete the routers
will surge read and catch up.

I also get these warnings in some of the taskmanager logs.

2022-05-05 13:43:14,118 WARN
>  org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl
> [] - Time from receiving all checkpoint barriers/RPC to executing it
> exceeded threshold: 132017ms
>

I am guessing now:  It sure seems that one of the router subtasks gets
behind, the checkpoint barrier gets sent to the subtask but it takes
forever for it to process through it.

Any thoughts/insights/suggestions would be appreciated.

[image: image.png]

Reply via email to