Hi We have several small kafka clusters (0.10.2) of 4 nodes running on cloud hosted VMs (each node runs on separate VM). We are having an issue of some random partitions going offline intermittently every few days at a fixed time (8 am) in a particular datacenter (identical clusters in other datacenters work fine). Most partitions have 2~3 replicas, each time the partitions go offline, they do not auto-recover and we have to do a rolling restart of the cluster to recover it.
We suspect there's some kind of daily scheduled activity on the VM/hardware level in that particular datacenter causing the offline partition but could not find anything suspicious. Can someone help me to understand under what conditions would a partition go offline and unable to auto-recover even though the trigger seems to be transient, and what kind of external factor (os, harddisk, network etc.) could possibly cause that? Is there any logging we can enable to debug this? Thanks, Di Shang -- Australian Development Lab L9 IBM Centre, 601 Pacific Hwy, St Leonards 2065 NSW Australia shan...@au1.ibm.com