My hypothesis for how Partition [luke3,3] with leader 11, had offset reset
to zero, caused by reboot of leader broker during partition reassignment:

The replicas for [luke3,3] were in progress being reassigned from broker
10,11,12 -> 11,12,13
I rebooted broker 11 which was the leader for [luke3.3]
Broker 12 and 13 logs indicate replica fetch failures from leader 11 due
to connection time out

Broker 10 attempts to become the leader for [luke3,3] but has an issue (I
see a zk exception but I'm unsure what is happening)

Broker 11 eventually comes online and attempts to fetch from new leader 10
Broker 11 completes fetch from leader 10 at offset 0
Broker 10 is leader but is serving a new data log and offset has been reset
Remaining brokers truncate logs and follow broker 10

Gist of logs for brokers 13,11,12 that I think backs up this summary:
https://gist.github.com/anonymous/cb79dc251d87e334cfff



Thanks,
Luke Forehand |  Networked Insights  |  Software Engineer



On 6/23/14, 5:57 PM, "Guozhang Wang" <wangg...@gmail.com> wrote:

>Hi Luke,
>
>What are the exceptions/warnings you saw in the broker and controller
>logs?
>
>Guozhang
>
>
>On Mon, Jun 23, 2014 at 2:03 PM, Luke Forehand <
>luke.foreh...@networkedinsights.com> wrote:
>
>> Hello,
>>
>> I am testing kafka 0.8.1.1 in preparation for an upgrade from
>> kafka-0.8.1-beta.  I have a 4 node cluster with one broker per node,
>>and a
>> topic with 8 partitions and 3 replicas.  Each partition has about 6
>> million records.
>>
>> I generated a partition reassignment json that basically causes all
>> partitions to be shifted by one broker.  As the reassignment was in
>> progress I bounced one of the servers.  After the server came back up
>>and
>> the broker started, I waited for the server logs to stop complaining and
>> then ran the reassignment verify script and all partitions were verified
>> as completed reassignment.
>>
>> However, one of the partition offsets was reset to 0, and 4 out of 8
>> partitions only had 2 in-sync-replicas instead of 3 (in-sync came back
>>to
>> 3 but only after I again bounced the server I had previously bounced
>> during reassignment).
>>
>> Is this considered a bug?  I ask because we use the SimpleConsumer API
>>so
>> we keep track of our own offset "pointers".  If it is not a bug then I
>> could reset the pointer to "earliest" and continue reading, but I was
>> wondering if there is a potential for data loss in my scenario.  I have
>> plenty of logs and can reproduce but before I spam I was wondering if
>> there is already a jira task for this issue or if anybody else is aware.
>>
>> Thanks,
>> Luke Forehand |  Networked Insights  |  Software Engineer
>>
>>
>
>
>-- 
>-- Guozhang

Reply via email to