Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Yogesh Sangvikar Wed, 20 Sep 2017 03:57:07 -0700

Hi Ismael,

A few questions:


1. Please share the code for the test script.

    [Comment: We are publishing events using kafka-rest
*POST: /topics/<topic-name>* API. And, using Jmeter script to call the API
to publish events continuously for 2 hrs. The "key" value for the event is
constant so that, we can  check to which partition events is getting
published.]

2. At which point in the sequence below was the code for the brokers
updated to 0.10.2?

[Comment: On the kafka servers, we have confluent-3.0.0 and confluent-3.2.2
packages deployed separately. So, first for protocol and message version to
0.10.0 we have updated server.properties file in running confluent-3.0.0
package and restarted the service for the same.
And, for protocol and message version to 0.10.2 bumb, we have modified
server.properties file in confluent-3.2.2 & stopped the old package
services and started the kafka services using new one. All restarts are
done rolling fashion and random broker.id sequence (4,3,2,1).]

3. When doing a rolling restart, it's generally a good idea to ensure that
there are no under-replicated partitions.

[Comment: Yes, for every restart we have waited for the required in-sync
replicas to be back.]

4. Is controlled shutdown completing successfully?

[Comment: Yes. We are stopping & starting the kafka services using
scripts kafka-server-stop & kafka-server-start.]


We are seeing some exceptions in kafka REST logs like,

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
2017-09-20 10:16:49 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@228c0e7e
org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.
2017-09-20 10:17:19 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@7d68db9d
org.apache.kafka.common.errors.TimeoutException: Batch containing 1
record(s) expired due to timeout while requesting metadata from brokers for
student-activity-3
2017-09-20 10:17:19 ERROR ProduceTask:71 - Producer error for request
io.confluent.kafkarest.ProduceTask@3bd78e12

Hope, those exceptions are expected while rolling restart of kafka servers
and data getting published.

Also, we have tried upgrade by adding explicit properties set like,
producer.acks=all
producer.retries=1

but, still the issue is same.

Thanks,
Yogesh

On Tue, Sep 19, 2017 at 6:48 PM, Ismael Juma <ism...@juma.me.uk> wrote:

> Hi Yogesh,
>
> A few questions:
>
> 1. Please share the code for the test script.
> 2. At which point in the sequence below was the code for the brokers
> updated to 0.10.2?
> 3. When doing a rolling restart, it's generally a good idea to ensure that
> there are no under-replicated partitions.
> 4. Is controlled shutdown completing successfully?
>
> Ismael
>
> On Tue, Sep 19, 2017 at 12:33 PM, Yogesh Sangvikar <
> yogesh.sangvi...@gmail.com> wrote:
>
> > Hi Team,
> >
> > Thanks for providing comments.
> >
> > Here adding more details on steps followed for upgrade,
> >
> > Cluster details: We are using 4 node kafka cluster and topics with 3
> > replication factor. For upgrade test, we are using a topic with 5
> > partitions & 3 replication factor.
> >
> > Topic:student-activity  PartitionCount:5        ReplicationFactor:3
> > Configs:
> >         Topic: student-activity Partition: 0    Leader: 4       Replicas:
> > 4,2,3 Isr: 4,2,3
> >         Topic: student-activity Partition: 1    Leader: 1       Replicas:
> > 1,3,4 Isr: 1,4,3
> >         Topic: student-activity Partition: 2    Leader: 2       Replicas:
> > 2,4,1 Isr: 2,4,1
> >         Topic: student-activity Partition: 3    Leader: 3       Replicas:
> > 3,1,2 Isr: 1,2,3
> >         Topic: student-activity Partition: 4    Leader: 4       Replicas:
> > 4,3,1 Isr: 4,1,3
> >
> > We are using a test script to publish events continuously to one of the
> > topic partition (here partition 3) and monitoring the scripts total
> > published events count  with the partition 3 offset value.
> >
> > [ Note: The topic partitions offset count may differ from CLI utility and
> > screenshot due to capture delay. ]
> >
> >    - First, we have rolling restarted all kafka brokers for explicit
> >    protocol and message version to 0.10.0, inter.broker.protocol.version=
> 0.10.0
> >
> >    log.message.format.version=0.10.0
> >
> >    - During this restarted, the events are getting published as expected
> >    and counters are increasing & in-sync replicas are coming up
> immediately
> >    post restart.
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:785
> >    student-activity:0:1
> >    [image: Inline image 1]
> >
> >
> >    - Next, we  have rolling restarted kafka brokers for
> >    "inter.broker.protocol.version=0.10.2" in below broker sequence.
> (note
> >    that, test script is publishing events to the topic partition
> continuously)
> >
> >    - Restarted server with  broker.id = 4,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:1189
> >    student-activity:0:1
> >
> >    [image: Inline image 2]
> >
> >    - Restarted server with  broker.id = 3,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    *student-activity:3:1430*
> >    student-activity:0:1
> >
> >
> >        [image: Inline image 3]
> >
> >
> >    - Restarted server with  broker.id = 2, (here, observe the partition
> 3
> >    offset count is decreased from last restart offset)
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    *student-activity:3:1357*
> >    student-activity:0:1
> >
> >            [image: Inline image 4]
> >
> >
> >    - Restarted last server with  broker.id = 1,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:1613
> >    student-activity:0:1
> >    [image: Inline image 5]
> >
> >    - Finally, rolling restarted all brokers (in same sequence above) for
> >    "log.message.format.version=0.10.2"
> >
> >
> >  [image: Inline image 6]
> > [image: Inline image 7]
> > [image: Inline image 8]
> >
> > [image: Inline image 9]
> >
> >    - The topic offset counter after final restart,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:2694
> >    student-activity:0:1
> >
> >
> >    - And, the topic offset counter after stopping events publish script,
> >
> >    [***@***.***.***.*** confluent-3.2.2]$ ./bin/kafka-run-class
> >    kafka.tools.GetOffsetShell --broker-list
> ***.***.***.***:9092,***.***.*
> >    **.***:9092,***.***.***.***:9092,***.***.***.***:9092 --topic
> >    student-activity --time -1
> >    student-activity:2:1
> >    student-activity:4:1
> >    student-activity:1:68
> >    student-activity:3:2769
> >    student-activity:0:1
> >
> >    - Calculating missing events counts,
> >    Total events published by script to partition 3 :
> > *3090 *Offset count on Partition 3 :
> > *2769 *
> >    Missing events count : 3090 - 2769 = *321*
> >
> >
> > As per above observation during rolling restart for protocol version,
> >
> >    1. The partition 3 leader changed to in-sync replica 2 (with older
> >    protocol version) and upgraded replicas (3 & 4) are missing from
> in-sync
> >    replica list.
> >    2. And, one we down server 2 down for upgrade, suddenly replicas 3 & 4
> >    appear in in-sync replica list and partition offset count resets.
> >    3. Post server 2 & 1 upgrade, 3 in-sync replicas shown for partition 3
> >    but, missing events lag is not recovered.
> >
> > Please let us know your comments on our observations and correct us if we
> > are missing any upgrade steps.
> >
> > Thanks,
> > Yogesh
> >
> > On Tue, Sep 19, 2017 at 2:07 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> >> Hi Scott,
> >>
> >> There is nothing preventing a replica running a newer version from being
> >> in
> >> sync as long as the instructions are followed (i.e.
> >> inter.broker.protocol.version has to be set correctly and, if there's a
> >> message format change, log.message.format.version). That's why I asked
> >> Yogesh for more details. The upgrade path he mentioned (0.10.0 ->
> 0.10.2)
> >> is straightforward, there isn't a message format change, so only
> >> inter.broker.protocol.version needs to be set.
> >>
> >> Ismael
> >>
> >> On Mon, Sep 18, 2017 at 5:50 PM, Scott Reynolds <
> >> sreyno...@twilio.com.invalid> wrote:
> >>
> >> > Can we get some clarity on this point:
> >> > >older version leader is not allowing newer version replicas to be in
> >> sync,
> >> > so the data pushed using this older version leader
> >> >
> >> > That is super scary.
> >> >
> >> > What protocol version is the older version leader running?
> >> >
> >> > Would this happen if you are skipping a protocol version bump?
> >> >
> >> >
> >> >
> >> > On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <ism...@juma.me.uk>
> wrote:
> >> >
> >> > > Hi Yogesh,
> >> > >
> >> > > Can you please clarify what you mean by "observing data loss"?
> >> > >
> >> > > Ismael
> >> > >
> >> > > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar <
> >> > > yogesh.sangvi...@gmail.com> wrote:
> >> > >
> >> > > > Hi Team,
> >> > > >
> >> > > > Please help to find resolution for below kafka rolling upgrade
> >> issue.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Yogesh
> >> > > >
> >> > > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh
> >> Sangvikar
> >> > > > wrote:
> >> > > >>
> >> > > >> Hi Team,
> >> > > >>
> >> > > >> Currently, we are using confluent 3.0.0 kafka cluster in our
> >> > production
> >> > > >> environment. And, we are planing to upgrade the kafka cluster for
> >> > > confluent
> >> > > >> 3.2.2
> >> > > >> We are having topics with millions on records and data getting
> >> > > >> continuously published to those topics. And, also, we are using
> >> other
> >> > > >> confluent services like schema-registry, kafka connect and kafka
> >> rest
> >> > to
> >> > > >> process the data.
> >> > > >>
> >> > > >> So, we can't afford downtime upgrade for the platform.
> >> > > >>
> >> > > >> We have tries rolling kafka upgrade as suggested on blogs in
> >> > Development
> >> > > >> environment,
> >> > > >>
> >> > > >>
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.
> >> > confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=
> >> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> >> > w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw
> >> 6q06RHgl0&e=
> >> > > >>
> >> > > >>
> >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.
> >> > apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2
> >> OvBCa_eow&r=
> >> > ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_
> >> > w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZ
> >> JhteC11Fs&e=
> >> > > >>
> >> > > >> But, we are observing data loss on topics while doing rolling
> >> upgrade
> >> > /
> >> > > >> restart of kafka servers for "inter.broker.protocol.version
> >> =0.10.2".
> >> > > >>
> >> > > >> As per our observation, we suspect the root cause for the data
> loss
> >> > > >> (explained for a topic partition having 3 replicas),
> >> > > >>
> >> > > >>    - As the kafka broker protocol version updates from 0.10.0 to
> >> > 0.10.2
> >> > > >>    in rolling fashion, the in-sync replicas having older version
> >> will
> >> > > not
> >> > > >>    allow updated replicas (0.10.2) to be in sync unless are all
> >> > updated.
> >> > > >>    - Also, we have explicitly disabled "unclean.leader.election.
> >> > enabled"
> >> > > >>    property, so only in-sync replicas will be elected as leader
> for
> >> > the
> >> > > given
> >> > > >>    partition.
> >> > > >>    - While doing rolling fashion update, as mentioned above,
> older
> >> > > >>    version leader is not allowing newer version replicas to be in
> >> > sync,
> >> > > so the
> >> > > >>    data pushed using this older version leader, will not be
> synced
> >> > with
> >> > > other
> >> > > >>    replicas and if this leader(older version)  goes down for an
> >> > > upgrade, other
> >> > > >>    updated replicas will be shown in in-sync column and become
> >> leader,
> >> > > but
> >> > > >>    they lag in offset with old version leader and shows the
> offset
> >> of
> >> > > the data
> >> > > >>    till they have synced.
> >> > > >>    - And, once the last replica comes up with updated version,
> will
> >> > > >>    start syncing data from the current leader.
> >> > > >>
> >> > > >>
> >> > > >> Please let us know comments on our observation and suggest proper
> >> way
> >> > > for
> >> > > >> rolling kafka upgrade as we can't afford downtime.
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Yogesh
> >> > > >>
> >> > > >
> >> > >
> >> > --
> >> >
> >> > Scott Reynolds
> >> > Principal Engineer
> >> > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> >> > MOBILE (630) 254-2474
> >> > EMAIL sreyno...@twilio.com
> >> >
> >>
> >
> >
>

Re: Data loss while upgrading confluent 3.0.0 kafka cluster to confluent 3.2.2

Reply via email to