Re: Zookeeper reconnect failed due to 'state changed (Expired)'

Neha Narkhede Wed, 01 Oct 2014 07:24:47 -0700

 But, as long as broker with the unreplicated messages comes back into the
ISR, those messages should be replicated, no?


Or, would the fact that the a new leader has been chosen for a partition,
and new messages have been sent to that new leader, cause the offending
broker to drop the unreplicated messages when it comes back into the ISR?

The moment the leader moves to another broker, the partition's source of
truth is the new broker's log and other followers truncate their logs to
follow the new leader. So, any unreplicated messages that didn't reach the
new leader are lost. If the old leader rejoins ISR, it will also truncate
it's log to follow the new leader's log.

Thanks,
Neha

On Wed, Oct 1, 2014 at 5:48 AM, Andrew Otto <ao...@wikimedia.org> wrote:

> I understand that, but even if the leader quickly (within a few seconds)
> rejoins the cluster?  I had thought that ack=1 meant that messages not yet
> replicated from a broker that has a serious fatal failure (disk corruption,
> etc.) would be lost forever.  But, as long as broker with the unreplicated
> messages comes back into the ISR, those messages should be replicated, no?
>
> Or, would the fact that the a new leader has been chosen for a partition,
> and new messages have been sent to that new leader, cause the offending
> broker to drop the unreplicated messages when it comes back into the ISR?
>
>
> On Sep 30, 2014, at 7:17 PM, Jun Rao <jun...@gmail.com> wrote:
>
> > With ack=1, acked messages could be lost when the leader fails.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Sep 29, 2014 at 8:04 AM, Andrew Otto <o...@wikimedia.org> wrote:
> >
> >> This happened again to me this weekend.  I've done some sleuthing, and
> >> definitely can see some crazy paging stats when this lock up happens.
> For
> >> the curious, more info can be found here:
> >> https://bugzilla.wikimedia.org/show_bug.cgi?id=69667.  I had tuned
> >> dirty_expire_centisecs from 30 seconds to 10, but this still happened
> >> again.  I'll continue to troubleshoot and tune.  This slow going
> because it
> >> is not regularly reproducible.  I have to make a single change and then
> >> wait a week or two for the timeout to occur.
> >>
> >> Here's a related question.  When the timeout happens, we lose some
> >> messages.  Our producer is varnishkafka, which uses the librdkafka
> producer
> >> client.  librdkafka keeps track of produce errors.  We
> >> have kafka.topic.request.required.acks = 1.  According to librdkafka,
> all
> >> messages sent have been ACKed by the leader of the partition to which
> the
> >> messages are sent.  Also, when we lose messages due to this timeout, the
> >> broker that times out is always the controller.  When it attempts to
> >> reconnect to Zookeeper, we see:
> >>
> >>  INFO  kafka.utils.ZkUtils$  - conflict in /controller data:
> >> {"version":1,"brokerid":21,"timestamp":"1411879981756"} stored data:
> >> {"version":1,"brokerid":22,"timestamp":"1407187809296"}
> >>
> >> In the case when a controller drops out of the ISR for a few seconds,
> is it
> >> possible for this confused broker to drop ACKed messages?
> >>
> >>
> >>
> >> On Thu, Jul 3, 2014 at 12:48 AM, Jun Rao <jun...@gmail.com> wrote:
> >>
> >>> Are you on Linux? We have seen this pattern (user/sys time low and real
> >>> time high in GC time) before. In our case, the problem was due to disk
> >>> I/Os. When there are lots of dirty pages (in our case, this is caused
> by
> >>> log4j logging), Linux can draft user threads (in this case GC threads)
> to
> >>> flush the dirty pages. So, all those time in real was spent on disk
> I/Os,
> >>> rather than real GCs. The fix is to tune dirty_expire_centisecs and
> >>> dirty_writeback_centisecs
> >>> to flush dirty pages more frequently to avoid such drafting.
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>>
> >>> On Wed, Jul 2, 2014 at 1:32 PM, Andrew Otto <o...@wikimedia.org>
> wrote:
> >>>
> >>>> Hi again!
> >>>>
> >>>> I've been having this issue consistently since I first started this
> >>> thread,
> >>>> but it was happening infrequently enough for me to brush it aside and
> >>> just
> >>>> run an election to rebalance brokers.
> >>>>
> >>>> I recently expanded (and reinstalled) our Kafka cluster so that it now
> >>> has
> >>>> 4 brokers with a default replication factor of 3 for each partition.
> I
> >>>> also switched over to the G1GC as recommended here:
> >>>> https://kafka.apache.org/081/ops.html (even though we are still
> >> running
> >>>> Kafka 0.8.0; we hope to upgrade soon).
> >>>>
> >>>> Now, only one of the 4 brokers (analytics1021, the same problem broker
> >> we
> >>>> saw before) gets its ZK connection expired even more frequently.
> >>>> Previously it was less than once a week, now I am seeing this happen
> >>>> multiple times a day.
> >>>>
> >>>> I've posted all the relevant logs from a recent event here:
> >>>> https://gist.github.com/ottomata/e42480446c627ea0af22
> >>>>
> >>>> This includes the GC log on the offending Kafka broker during the time
> >>> this
> >>>> happened.  I am pretty green when it comes to GC tuning, but I do see
> >>> this
> >>>> interesting stat:
> >>>>
> >>>>
> >>>> [Times: user=0.14 sys=0.00, real=11.47 secs]
> >>>>
> >>>> Did Kafka's JVM really just take 11.47 secs to do a GC there?  I'm
> >>>> probably missing something, but I don't see which part of that real
> >>>> time summary makes up the bulk of that GC time
> >>>>
> >>>> This is strange, riight?  This broker is identically configured to all
> >>>> its peers, and should be handling on average the exact same amount and
> >>>> type of traffic.  Anyone have any advice?
> >>>>
> >>>> Thanks!
> >>>> -Andrew Otto
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Mar 21, 2014 at 6:48 PM, Neha Narkhede <
> >> neha.narkh...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> I see, that makes sense. Thanks a lot for clarifying!
> >>>>>
> >>>>> -Neha
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 21, 2014 at 11:01 AM, Bae, Jae Hyeon <metac...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Let me clarify the situation. I forgot to mention that my case
> >> might
> >>>> not
> >>>>> be
> >>>>>> general one because Netflix is using Apache Curator as the main
> >>>> zookeeper
> >>>>>> client and ZkClient in Kafka should be bridged to Apache Curator,
> >> so
> >>>> the
> >>>>>> behavior I have seen might not be general one.
> >>>>>>
> >>>>>> Kafka's ZKSessionExpireListener.handleNewSession() is reinstating
> >> all
> >>>>>> ephemeral nodes and watchers but handleNewSession() was not kicked
> >> in
> >>>> my
> >>>>>> case. So, I created Netflix internal version of ZkClient to replace
> >>>>>> ephemeral node creation and watcher reinstating.
> >>>>>>
> >>>>>> I have a plan to remove all external dependency from Kafka soon
> >> and I
> >>>>> might
> >>>>>> be able to mention more about Kafka's zk resiliency.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Mar 21, 2014 at 7:51 AM, Jun Rao <jun...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi, Jae,
> >>>>>>>
> >>>>>>> My understanding is that ephemeral nodes can only be lost on ZK
> >>>> session
> >>>>>>> expiration. Are you saying that there are other cases when
> >>> ephemeral
> >>>>>> nodes
> >>>>>>> can be lost?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jun
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Mar 20, 2014 at 9:52 PM, Bae, Jae Hyeon <
> >>> metac...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> This issue is zookeeper resiliency.
> >>>>>>>>
> >>>>>>>> What I have done is, ephemeral node creation is replaced by
> >>> Apache
> >>>>>>>> Curator's PersistentEphemeralNode recipe, to reinstate
> >> ephemeral
> >>>>> nodes
> >>>>>>>> after zookeeper blip. Also, all watchers also should be
> >>> reinstated.
> >>>>>> Kafka
> >>>>>>>> internally only handles session expired event but there are
> >> many
> >>>> bad
> >>>>>>>> situations of zookeepers where Kafka can go haywire.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Mar 20, 2014 at 9:23 PM, Neha Narkhede <
> >>>>>> neha.narkh...@gmail.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> 1.  What does 'zookeeper state changed (Expired)' mean?
> >>>>>>>>>
> >>>>>>>>> 2.  Has anyone seen issues like this before?  Where zookeeper
> >>>>>>> connections
> >>>>>>>>> are flaky enough to cause leader elections?
> >>>>>>>>>
> >>>>>>>>> It means zookeeper expired the session. The most common
> >> reason
> >>>> for
> >>>>>> this
> >>>>>>>> is
> >>>>>>>>> client side GC (in your case, client is the Kafka broker)
> >>> though
> >>>> if
> >>>>>>> your
> >>>>>>>>> zookeeper setup is flaky, this might even point to a server
> >>> side
> >>>>> I/O
> >>>>>>>>> bottleneck or GC. Could you inspect the broker and
> >> zookeeper's
> >>> gc
> >>>>>> logs?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Mar 20, 2014 at 3:26 PM, Andrew Otto <
> >>> o...@wikimedia.org
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hm, just saw something a little fishy.
> >>>>>>>>>>
> >>>>>>>>>> (In the following logs, analytics1021 (id 21) and
> >>> analytics1022
> >>>>> (id
> >>>>>>> 22)
> >>>>>>>>>> are Brokers and analytics1023,analytics1024,analytics1025
> >> are
> >>>>>>>>> Zookeepers.)
> >>>>>>>>>>
> >>>>>>>>>> At 2014-03-20 21:12:26, analytics1021 lost its connection
> >> to
> >>>>>>> zookeeper.
> >>>>>>>>> It
> >>>>>>>>>> reconnected to analytics1023, but as it was doing so, it
> >>>> logged a
> >>>>>>>>>> 'zookeeper state changed (Expired)' message, which
> >> apparently
> >>>>>> caused
> >>>>>>>> the
> >>>>>>>>>> newly opened socket to close again.  Am I reading that
> >> right?
> >>>>>>>>>>
> >>>>>>>>>> In the time that it took analytics1021 to finally
> >>> re-establish
> >>>> a
> >>>>>>>>> zookeeper
> >>>>>>>>>> connection, analytics1022 had become the leader for all
> >>>>> partitions.
> >>>>>>> I
> >>>>>>>>> can
> >>>>>>>>>> rebalance the leaders, but I'd prefer if this didn't happen
> >>> in
> >>>>> the
> >>>>>>>> first
> >>>>>>>>>> place.
> >>>>>>>>>>
> >>>>>>>>>> 1.  What does 'zookeeper state changed (Expired)' mean?
> >>>>>>>>>> 2.  Has anyone seen issues like this before?  Where
> >> zookeeper
> >>>>>>>> connections
> >>>>>>>>>> are flaky enough to cause leader elections?
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>> -Andrew Otto
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [2014-03-20 21:12:26,427] 685698013
> >>>>>>>>>> [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Client session timed out,
> >>>> have
> >>>>>> not
> >>>>>>>>> heard
> >>>>>>>>>> from server in 10067ms for sessionid 0x1425d903a3f1f89,
> >>> closing
> >>>>>>> socket
> >>>>>>>>>> connection and attempting reconnect
> >>>>>>>>>> [2014-03-20 21:12:26,528] 685698114 [main-EventThread] INFO
> >>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
> >>>>>> (Disconnected)
> >>>>>>>>>> [2014-03-20 21:12:27,252] 685698838
> >>>>>>>>>> [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Opening socket connection
> >>> to
> >>>>>> server
> >>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181
> >>>>>>>>>> [2014-03-20 21:12:27,271] 685698857
> >>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Socket connection
> >>> established
> >>>>> to
> >>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating
> >>> session
> >>>>>>>>>> [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
> >>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
> >>>> (Expired)
> >>>>>>>>>> [2014-03-20 21:12:27,273] 685698859
> >>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Unable to reconnect to
> >>>>> ZooKeeper
> >>>>>>>>> service,
> >>>>>>>>>> session 0x1425d903a3f1f89 has expired, closing socket
> >>>> connection
> >>>>>>>>>> [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
> >>>>>>>>>> org.apache.zookeeper.ZooKeeper - Initiating client
> >>> connection,
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> connectString=analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad
> >>>>>>>>>> sessionTimeout=6000
> >>>> watcher=org.I0Itec.zkclient.ZkClient@63203b59
> >>>>>>>>>> [2014-03-20 21:12:27,290] 685698876
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> >>>>>>>>>> INFO kafka.server.KafkaZooKeeper - re-registering broker
> >> info
> >>>> in
> >>>>> ZK
> >>>>>>> for
> >>>>>>>>>> broker 21
> >>>>>>>>>> [2014-03-20 21:12:27,290] 685698876 [main-EventThread] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - EventThread shut down
> >>>>>>>>>> [2014-03-20 21:12:27,290] 685698876 [main-SendThread()]
> >> INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Opening socket connection
> >>> to
> >>>>>> server
> >>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181
> >>>>>>>>>> [2014-03-20 21:12:27,292] 685698878
> >>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Socket connection
> >>> established
> >>>>> to
> >>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating
> >>> session
> >>>>>>>>>> [2014-03-20 21:12:27,313] 685698899
> >>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> >>>>>>>>>> org.apache.zookeeper.ClientCnxn - Session establishment
> >>>> complete
> >>>>> on
> >>>>>>>>> server
> >>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, sessionid =
> >>>>>>>>>> 0xff429b4201dd2352, negotiated timeout = 6000
> >>>>>>>>>> [2014-03-20 21:12:27,314] 685698900 [main-EventThread] INFO
> >>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
> >>>>>>> (SyncConnected)
> >>>>>>>>>> [2014-03-20 21:12:27,324] 685698910
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> >>>>>>>>>> INFO kafka.server.KafkaZooKeeper - done re-registering
> >> broker
> >>>>>>>>>> [2014-03-20 21:12:27,324] 685698910
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> >>>>>>>>>> INFO kafka.server.KafkaZooKeeper - Subscribing to
> >>>> /brokers/topics
> >>>>>>> path
> >>>>>>>> to
> >>>>>>>>>> watch for new topics
> >>>>>>>>>> [2014-03-20 21:12:27,370] 685698956
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> >>>>>>>>>> INFO
> >>> kafka.server.ZookeeperLeaderElector$LeaderChangeListener -
> >>>>> New
> >>>>>>>>> leader
> >>>>>>>>>> is 22
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Zookeeper reconnect failed due to 'state changed (Expired)'

Reply via email to