I’ve had the same experience as Liam with this symptom (all followers on a single broker of a given leader getting stuck). It sounds likely that either the replica fetcher thread is getting stuck or dying with an unhandled exception.
The the former case, jstack output can be helpful to understand why the fetcher is stuck. There may or may not be a message in the broker logs on the broker that’s failing to get in sync. In the latter case, there should be evidence in the broker logs on the broker that’s failing to get in sync (and the thread will be notably absent in jstack output) Ben On Wed, Apr 1, 2020 at 5:40 PM Liam Clarke <liam.cla...@adscale.co.nz> wrote: > Hi Zach, > > If you check the cluster's controller's controller.log, do you see broker > 2 bouncing in and out of ISRs? There'll be logs to that effect. Or is it > just never getting in-sync in the first place? > > Whenever I've had this issue in the past, it's been because the replica > fetcher has died. Hate to say this, but have tried turning broker 2 on and > off again? It's usually how I've resolved this issue when a broker won't > stay in ISR. Also make sure that there's enough CPU/network on the machine > it's running on - we've usually had this issue where CPU was very high or > the network saturated. > > Cheers, > > Liam Clarke-Hutchinson > > On Thu, Apr 2, 2020 at 8:51 AM Zach Cox <zcox...@gmail.com> wrote: > > > Hi Liam, > > > > > > > Any issues with partitions broker 2 is leader of? > > > > > > > Earlier today, broker 2 was not leader of any partitions. At that time, 2 > > appeared to be in ISRs of all partitions where 1 was leader, but 2 was > not > > in any ISRs of partitions where 0 was leader. > > > > Currently, broker 2 is leader of 55 partitions, but does not appear to be > > in ISRs of any other partitions, whether 0 or 1 is leader. > > > > > > > Also, have you checked b2's server.log? > > > > > > > We don't see any logs that obviously indicate the problem, although we're > > also not sure what things we should be looking for. There are a few > > Zookeeper client timeouts, but haven't correlated that with anything yet. > > > > Thanks, > > Zach > > >