Hi Biao, Thank you for your response. We have tried looking into Thread dumps of Task Managers before but that's not helping our case. We see that even when all the Taskslots of that particular operator are stuck in an INITIALISING state, many of them have already started processing new data. Is there any other way we can approach this?
On 2024/05/06 03:54:04 Biao Geng wrote: > Hi Abhi, > > If your case can be reproduced steadily, have your ever tried to get > the thread dump of the TM which the problematic operator resides in? > Maybe we can get more clues with the thread dump to see where the > operator is getting stuck. > > Best, > Biao Geng > > Abhi Sagar Khatri via user <us...@flink.apache.org> 于2024年4月30日周二 19:38写道: > > > > Some more context: Our job graph has 5 different Tasks/operators/flink functions of which we are seeing this issue every time in a particular operator > > We’re using Unaligned checkpoints. With aligned checkpoint we don’t see this issue but the checkpoint duration in that case is very high and causes timeouts. > > > > On Tue, Apr 30, 2024 at 3:05 PM Abhi Sagar Khatri <a....@salesforce.com> wrote: > >> > >> Hi Flink folks, > >> Our team has been working on a Flink service. After completing the service development, we moved on to the Job Stabilisation exercises at the production load. > >> During high load, we see that if the job restarts (mostly due to the "org.apache.flink.util.FlinkExpectedException: The TaskExecutor is shutting down"), one of the operators gets stuck in the INITIALISATION state. This happens even when all the required capacity is present and all the TMs are up and running. Other operators that have even higher parallelism than this particular operator initialize fast whilst this particular operator sometimes takes more than 30 minutes. > >> We're operating on Flink 1.16.1. > >> > >> Thank you, > >> Abhi >