I forgot to mention this is with Kafka 0.8.2.0.  The topic in question has
20 partitions (across the 6 brokers).  This issue seems to pop up every
couple of days regardless of message volume (it's happened at 50k
messages/s on the topic and 10k).  So far it it hasn't hit the same
partition more than once, and never more than one at a time.  Any advice
would be greatly appreciated.

Thanks!

On Thu, Apr 16, 2015 at 11:47 AM, Romesh McCullough <
romesh.mccullo...@gmail.com> wrote:

> We have a 6 broker cluster running in AWS in 3 availability zones.  A few
> times while under slight load (40k messages/second, roughly) we have seen a
> replica try to request a message from the leader by an index that is
> slightly in the future, 3-6 messages usually.  When this happens the
> replica throws an error, deletes all of its data for that partition, and
> resyncs from the beginning of the leader.  Given that the offset difference
> is so small I suspect a latency/timing issue, but am uncertain what to
> tweak.  Thank you in advance for any assistance!
>
> Leader logs:
>
> [2015-04-15 02:07:21,328] ERROR [Replica Manager on Broker 2]: Error when
> processing fetch request for partition [xxx.prod,1] offset 127413332 from
> follower with correlation id 35310725. Possible cause: Request for offset
> 127413332 but we only have log segments in the range 429569 to 127413328.
> (kafka.server.ReplicaManager)
> [2015-04-15 02:07:23,593] INFO Partition [xxx.prod,1] on broker 2:
> Shrinking ISR for partition [xxx.prod,1] from 2,6 to 2
> (kafka.cluster.Partition)
>
> Follower logs:
> ...
> [2015-04-15 02:08:02,085] INFO Scheduling log segment 124662576 for log
> xxx.prod-1 for deletion. (kafka.log.Log)
> [2015-04-15 02:08:02,086] INFO Scheduling log segment 126360465 for log
> xxx.prod-1 for deletion. (kafka.log.Log)
> [2015-04-15 02:08:02,121] WARN [ReplicaFetcherThread-3-2], Replica 6 for
> partition [xxx.prod,1] reset its fetch offset from 429569 to current leader
> 2's start offset 429569 (kafka.server.ReplicaFetcherThread)
> [2015-04-15 02:08:02,131] ERROR [ReplicaFetcherThread-3-2], Current offset
> 127413332 for partition [xxx.prod,1] out of range; reset offset to 429569
> (kafka.server.ReplicaFetcherThread)
>
> Relevant config:
>
> num.network.threads=8
> num.io.threads=8
> socket.send.buffer.bytes=1048576
> socket.receive.buffer.bytes=1048576
> socket.request.max.bytes=104857600
> default.replication.factor=2
> num.replica.fetchers=4
> replica.fetch.max.bytes=1048576
> replica.fetch.wait.max.ms=3000
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.socket.timeout.ms=30000
> replica.socket.receive.buffer.bytes=65536
> replica.lag.time.max.ms=10000
> replica.lag.max.messages=4000
> controller.socket.timeout.ms=30000
> controller.message.queue.size=100000
>
>

Reply via email to