Hi, In testing - i’ve got a 5 node Kafka cluster with min.insync.replicas set to 4. The cluster is currently running version 0.10.0.1.
There’s an application that publishes to a topic - and on restart, it attempts to read the full contents of the topic up until the high watermark before then publishing extra messages to the topic. This particular topic routinely contains between 600,000 and 5 million messages and consists of a single partition, as we rely on precise ordering. To determine the offset for the high-water mark, we do something similar to the following: // Do not use offset management props.put(“enable.auto.commit”, false); props.put(“auto.offset.reset”, “latest’); // ... consumer.assign(tpList); consumer.poll(100); long maxOffset = consumer.position(tp) - 1; This has been working fine for months and then all of a sudden, we found that maxOffset was returning an offset that could be 10s or even 1000s of offsets away from the real high watermark. As far as I can tell, the cluster state seems to be OK - there are no offline partitions or out of sync replicas when we see this issue. We also are not using transactional messages in Kafka. We found that doing an explicit seekToEnd before checking the position seems to help, eg: // ... consumer.assign(tpList); consumer.poll(100); consumer.seekToEnd(tpList); long maxOffset = consumer.position(tp) - 1; But I can’t understand why this is necessary - when everything seems to have been working prior to this? I’m now worried the cluster is in a bad state, and we’re not capturing the health status correctly or missing some error messages somewhere. Keen for any ideas about what this might be or for things I can try. Thanks in advance, Henry