I forgot to mention this is with Kafka The topic in question has 20 partitions (across the 6 brokers). This issue seems to pop up every couple of days regardless of message volume (it's happened at 50k messages/s on the topic and 10k). So far it it hasn't hit the same partition more than once, and never more than one at a time. Any advice would be greatly appreciated.
Thanks! On Thu, Apr 16, 2015 at 11:47 AM, Romesh McCullough < romesh.mccullo...@gmail.com> wrote: > We have a 6 broker cluster running in AWS in 3 availability zones. A few > times while under slight load (40k messages/second, roughly) we have seen a > replica try to request a message from the leader by an index that is > slightly in the future, 3-6 messages usually. When this happens the > replica throws an error, deletes all of its data for that partition, and > resyncs from the beginning of the leader. Given that the offset difference > is so small I suspect a latency/timing issue, but am uncertain what to > tweak. Thank you in advance for any assistance! > > Leader logs: > > [2015-04-15 02:07:21,328] ERROR [Replica Manager on Broker 2]: Error when > processing fetch request for partition [xxx.prod,1] offset 127413332 from > follower with correlation id 35310725. Possible cause: Request for offset > 127413332 but we only have log segments in the range 429569 to 127413328. > (kafka.server.ReplicaManager) > [2015-04-15 02:07:23,593] INFO Partition [xxx.prod,1] on broker 2: > Shrinking ISR for partition [xxx.prod,1] from 2,6 to 2 > (kafka.cluster.Partition) > > Follower logs: > ... > [2015-04-15 02:08:02,085] INFO Scheduling log segment 124662576 for log > xxx.prod-1 for deletion. (kafka.log.Log) > [2015-04-15 02:08:02,086] INFO Scheduling log segment 126360465 for log > xxx.prod-1 for deletion. (kafka.log.Log) > [2015-04-15 02:08:02,121] WARN [ReplicaFetcherThread-3-2], Replica 6 for > partition [xxx.prod,1] reset its fetch offset from 429569 to current leader > 2's start offset 429569 (kafka.server.ReplicaFetcherThread) > [2015-04-15 02:08:02,131] ERROR [ReplicaFetcherThread-3-2], Current offset > 127413332 for partition [xxx.prod,1] out of range; reset offset to 429569 > (kafka.server.ReplicaFetcherThread) > > Relevant config: > > num.network.threads=8 > num.io.threads=8 > socket.send.buffer.bytes=1048576 > socket.receive.buffer.bytes=1048576 > socket.request.max.bytes=104857600 > default.replication.factor=2 > num.replica.fetchers=4 > replica.fetch.max.bytes=1048576 > replica.fetch.wait.max.ms=3000 > replica.high.watermark.checkpoint.interval.ms=5000 > replica.socket.timeout.ms=30000 > replica.socket.receive.buffer.bytes=65536 > replica.lag.time.max.ms=10000 > replica.lag.max.messages=4000 > controller.socket.timeout.ms=30000 > controller.message.queue.size=100000 > >