Hi

We have several small kafka clusters (0.10.2) of 4 nodes running on cloud 
hosted VMs (each node runs on separate VM). We are having an issue of some 
random partitions going offline intermittently every few days at a fixed 
time (8 am) in a particular datacenter (identical clusters in other 
datacenters work fine). Most partitions have 2~3 replicas, each time the 
partitions go offline, they do not auto-recover and we have to do a 
rolling restart of the cluster to recover it. 

We suspect there's some kind of daily scheduled activity on the 
VM/hardware level in that particular datacenter causing the offline 
partition but could not find anything suspicious. 

Can someone help me to understand under what conditions would a partition 
go offline and unable to auto-recover even though the trigger seems to be 
transient, and what kind of external factor (os, harddisk, network etc.) 
could possibly cause that? Is there any logging we can enable to debug 
this?


Thanks,

Di Shang

--

Australian Development Lab
L9 IBM Centre, 601 Pacific Hwy, St Leonards 2065 NSW Australia
shan...@au1.ibm.com

Reply via email to