When you are running in any environment you are not in 100% control of you have to take a look at everything that is happening and what the environment provides for your software
The network connectivity and bandwidth typically varies based on machine sizes. That could impact your setup and contribute to what you are observing. Other apps with high network I/O could impact your set up if they are sharing the same resources. You may need to provide additional details like you Zookeeper and Broker configuration as well as how the Kafka ecosystem components are distributed on the virtual machines as well as the sizing of the VMs All these metadata are private but they can help shed some light as to what is happening Also make sure you are reviewing reference architectures for your environment and comparing them to your setup to make sure they are in alignment I hope this gives you some information to get started with your investigation On Fri, Jul 9, 2021 at 5:49 PM Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > On Fri, Jul 9, 2021 at 7:35 AM Oleksandr Shulgin < > oleksandr.shul...@zalando.de> wrote: > > > > > Since version 2.7.0 we observe the subj. issue with the first start of > the > > Kafka process when we rotate the EC2 instances (for the sake of software > > upgrade). > > Our supervisor script notices the failure and tries to start it again in > a > > few seconds, which has always been successful so far. > > > > I should have mentioned that it doesn't happen on _every_ first start, but > it's often enough to be annoyed by the issue. E.g. today it happened for > 17 out of 45 brokers (so around 40%) in a rolling restart. > > > Cheers, > -- > Alex >