Can we get some clarity on this point: >older version leader is not allowing newer version replicas to be in sync, so the data pushed using this older version leader
That is super scary. What protocol version is the older version leader running? Would this happen if you are skipping a protocol version bump? On Mon, Sep 18, 2017 at 9:33 AM Ismael Juma <ism...@juma.me.uk> wrote: > Hi Yogesh, > > Can you please clarify what you mean by "observing data loss"? > > Ismael > > On Mon, Sep 18, 2017 at 5:08 PM, Yogesh Sangvikar < > yogesh.sangvi...@gmail.com> wrote: > > > Hi Team, > > > > Please help to find resolution for below kafka rolling upgrade issue. > > > > Thanks, > > > > Yogesh > > > > On Monday, September 18, 2017 at 9:03:04 PM UTC+5:30, Yogesh Sangvikar > > wrote: > >> > >> Hi Team, > >> > >> Currently, we are using confluent 3.0.0 kafka cluster in our production > >> environment. And, we are planing to upgrade the kafka cluster for > confluent > >> 3.2.2 > >> We are having topics with millions on records and data getting > >> continuously published to those topics. And, also, we are using other > >> confluent services like schema-registry, kafka connect and kafka rest to > >> process the data. > >> > >> So, we can't afford downtime upgrade for the platform. > >> > >> We have tries rolling kafka upgrade as suggested on blogs in Development > >> environment, > >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.confluent.io_3.2.2_upgrade.html&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_w54zWrVd48xst46GuPGCxV0&s=DMcA8JOnGXNNa_dRFpkNOd7AJoIQUgkEcw6q06RHgl0&e= > >> > >> > https://urldefense.proofpoint.com/v2/url?u=https-3A__kafka.apache.org_documentation_-23upgrade&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=JGTnLlVIAvVddNas19L_w54zWrVd48xst46GuPGCxV0&s=0p4Fn8sKMbVJMR6nk42C-lhyujAEVUXTYZJhteC11Fs&e= > >> > >> But, we are observing data loss on topics while doing rolling upgrade / > >> restart of kafka servers for "inter.broker.protocol.version=0.10.2". > >> > >> As per our observation, we suspect the root cause for the data loss > >> (explained for a topic partition having 3 replicas), > >> > >> - As the kafka broker protocol version updates from 0.10.0 to 0.10.2 > >> in rolling fashion, the in-sync replicas having older version will > not > >> allow updated replicas (0.10.2) to be in sync unless are all updated. > >> - Also, we have explicitly disabled "unclean.leader.election.enabled" > >> property, so only in-sync replicas will be elected as leader for the > given > >> partition. > >> - While doing rolling fashion update, as mentioned above, older > >> version leader is not allowing newer version replicas to be in sync, > so the > >> data pushed using this older version leader, will not be synced with > other > >> replicas and if this leader(older version) goes down for an > upgrade, other > >> updated replicas will be shown in in-sync column and become leader, > but > >> they lag in offset with old version leader and shows the offset of > the data > >> till they have synced. > >> - And, once the last replica comes up with updated version, will > >> start syncing data from the current leader. > >> > >> > >> Please let us know comments on our observation and suggest proper way > for > >> rolling kafka upgrade as we can't afford downtime. > >> > >> Thanks, > >> Yogesh > >> > > > -- Scott Reynolds Principal Engineer [image: twilio] <http://www.twilio.com/?utm_source=email_signature> MOBILE (630) 254-2474 EMAIL sreyno...@twilio.com