Hi,

In testing - i’ve got a 5 node Kafka cluster with min.insync.replicas set to 4. 
The cluster is currently running version 0.10.0.1. 

There’s an application that publishes to a topic - and on restart, it attempts 
to read the full contents of the topic up until the high watermark before then 
publishing extra messages to the topic. This particular topic routinely 
contains between 600,000 and 5 million messages and consists of a single 
partition, as we rely on precise ordering. To determine the offset for the 
high-water mark, we do something similar to the following:

// Do not use offset management
props.put(“enable.auto.commit”, false);
props.put(“auto.offset.reset”, “latest’);

// ...

consumer.assign(tpList);
consumer.poll(100);

long maxOffset = consumer.position(tp) - 1;

This has been working fine for months and then all of a sudden, we found that 
maxOffset was returning an offset that could be 10s or even 1000s of offsets 
away from the real high watermark. As far as I can tell, the cluster state 
seems to be OK - there are no offline partitions or out of sync replicas when 
we see this issue. We also are not using transactional messages in Kafka. 

We found that doing an explicit seekToEnd before checking the position seems to 
help, eg:

// ...

consumer.assign(tpList);
consumer.poll(100);

consumer.seekToEnd(tpList);
long maxOffset = consumer.position(tp) - 1;

But I can’t understand why this is necessary - when everything seems to have 
been working prior to this? I’m now worried the cluster is in a bad state, and 
we’re not capturing the health status correctly or missing some error messages 
somewhere. 

Keen for any ideas about what this might be or for things I can try.

Thanks in advance,
Henry








Reply via email to