Hi Abhi, If your case can be reproduced steadily, have your ever tried to get the thread dump of the TM which the problematic operator resides in? Maybe we can get more clues with the thread dump to see where the operator is getting stuck.
Best, Biao Geng Abhi Sagar Khatri via user <user@flink.apache.org> 于2024年4月30日周二 19:38写道: > > Some more context: Our job graph has 5 different Tasks/operators/flink > functions of which we are seeing this issue every time in a particular > operator > We’re using Unaligned checkpoints. With aligned checkpoint we don’t see this > issue but the checkpoint duration in that case is very high and causes > timeouts. > > On Tue, Apr 30, 2024 at 3:05 PM Abhi Sagar Khatri <a.kha...@salesforce.com> > wrote: >> >> Hi Flink folks, >> Our team has been working on a Flink service. After completing the service >> development, we moved on to the Job Stabilisation exercises at the >> production load. >> During high load, we see that if the job restarts (mostly due to the >> "org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting >> down"), one of the operators gets stuck in the INITIALISATION state. This >> happens even when all the required capacity is present and all the TMs are >> up and running. Other operators that have even higher parallelism than this >> particular operator initialize fast whilst this particular operator >> sometimes takes more than 30 minutes. >> We're operating on Flink 1.16.1. >> >> Thank you, >> Abhi