Re: Zookeeper reconnect failed due to 'state changed (Expired)'

Andrew Otto Wed, 01 Oct 2014 07:30:18 -0700

Ah!  That makes so much sense, and is likely the cause of our lost messages.


Thanks, I am now experimenting with other ack values.  I’m a little worried
about latency, especially since some of our producers send traffic across
the Atlantic (we don’t use MirrorMaker :/ ).




On Oct 1, 2014, at 10:14 AM, Neha Narkhede <neha.narkh...@gmail.com> wrote:

> But, as long as broker with the unreplicated messages comes back into the
> ISR, those messages should be replicated, no?
> 
> Or, would the fact that the a new leader has been chosen for a partition,
> and new messages have been sent to that new leader, cause the offending
> broker to drop the unreplicated messages when it comes back into the ISR?
> 
> The moment the leader moves to another broker, the partition's source of
> truth is the new broker's log and other followers truncate their logs to
> follow the new leader. So, any unreplicated messages that didn't reach the
> new leader are lost. If the old leader rejoins ISR, it will also truncate
> it's log to follow the new leader's log.
> 
> Thanks,
> Neha
> 
> On Wed, Oct 1, 2014 at 5:48 AM, Andrew Otto <ao...@wikimedia.org> wrote:
> 
>> I understand that, but even if the leader quickly (within a few seconds)
>> rejoins the cluster?  I had thought that ack=1 meant that messages not yet
>> replicated from a broker that has a serious fatal failure (disk corruption,
>> etc.) would be lost forever.  But, as long as broker with the unreplicated
>> messages comes back into the ISR, those messages should be replicated, no?
>> 
>> Or, would the fact that the a new leader has been chosen for a partition,
>> and new messages have been sent to that new leader, cause the offending
>> broker to drop the unreplicated messages when it comes back into the ISR?
>> 
>> 
>> On Sep 30, 2014, at 7:17 PM, Jun Rao <jun...@gmail.com> wrote:
>> 
>>> With ack=1, acked messages could be lost when the leader fails.
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> On Mon, Sep 29, 2014 at 8:04 AM, Andrew Otto <o...@wikimedia.org> wrote:
>>> 
>>>> This happened again to me this weekend.  I've done some sleuthing, and
>>>> definitely can see some crazy paging stats when this lock up happens.
>> For
>>>> the curious, more info can be found here:
>>>> https://bugzilla.wikimedia.org/show_bug.cgi?id=69667.  I had tuned
>>>> dirty_expire_centisecs from 30 seconds to 10, but this still happened
>>>> again.  I'll continue to troubleshoot and tune.  This slow going
>> because it
>>>> is not regularly reproducible.  I have to make a single change and then
>>>> wait a week or two for the timeout to occur.
>>>> 
>>>> Here's a related question.  When the timeout happens, we lose some
>>>> messages.  Our producer is varnishkafka, which uses the librdkafka
>> producer
>>>> client.  librdkafka keeps track of produce errors.  We
>>>> have kafka.topic.request.required.acks = 1.  According to librdkafka,
>> all
>>>> messages sent have been ACKed by the leader of the partition to which
>> the
>>>> messages are sent.  Also, when we lose messages due to this timeout, the
>>>> broker that times out is always the controller.  When it attempts to
>>>> reconnect to Zookeeper, we see:
>>>> 
>>>> INFO  kafka.utils.ZkUtils$  - conflict in /controller data:
>>>> {"version":1,"brokerid":21,"timestamp":"1411879981756"} stored data:
>>>> {"version":1,"brokerid":22,"timestamp":"1407187809296"}
>>>> 
>>>> In the case when a controller drops out of the ISR for a few seconds,
>> is it
>>>> possible for this confused broker to drop ACKed messages?
>>>> 
>>>> 
>>>> 
>>>> On Thu, Jul 3, 2014 at 12:48 AM, Jun Rao <jun...@gmail.com> wrote:
>>>> 
>>>>> Are you on Linux? We have seen this pattern (user/sys time low and real
>>>>> time high in GC time) before. In our case, the problem was due to disk
>>>>> I/Os. When there are lots of dirty pages (in our case, this is caused
>> by
>>>>> log4j logging), Linux can draft user threads (in this case GC threads)
>> to
>>>>> flush the dirty pages. So, all those time in real was spent on disk
>> I/Os,
>>>>> rather than real GCs. The fix is to tune dirty_expire_centisecs and
>>>>> dirty_writeback_centisecs
>>>>> to flush dirty pages more frequently to avoid such drafting.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jun
>>>>> 
>>>>> 
>>>>> On Wed, Jul 2, 2014 at 1:32 PM, Andrew Otto <o...@wikimedia.org>
>> wrote:
>>>>> 
>>>>>> Hi again!
>>>>>> 
>>>>>> I've been having this issue consistently since I first started this
>>>>> thread,
>>>>>> but it was happening infrequently enough for me to brush it aside and
>>>>> just
>>>>>> run an election to rebalance brokers.
>>>>>> 
>>>>>> I recently expanded (and reinstalled) our Kafka cluster so that it now
>>>>> has
>>>>>> 4 brokers with a default replication factor of 3 for each partition.
>> I
>>>>>> also switched over to the G1GC as recommended here:
>>>>>> https://kafka.apache.org/081/ops.html (even though we are still
>>>> running
>>>>>> Kafka 0.8.0; we hope to upgrade soon).
>>>>>> 
>>>>>> Now, only one of the 4 brokers (analytics1021, the same problem broker
>>>> we
>>>>>> saw before) gets its ZK connection expired even more frequently.
>>>>>> Previously it was less than once a week, now I am seeing this happen
>>>>>> multiple times a day.
>>>>>> 
>>>>>> I've posted all the relevant logs from a recent event here:
>>>>>> https://gist.github.com/ottomata/e42480446c627ea0af22
>>>>>> 
>>>>>> This includes the GC log on the offending Kafka broker during the time
>>>>> this
>>>>>> happened.  I am pretty green when it comes to GC tuning, but I do see
>>>>> this
>>>>>> interesting stat:
>>>>>> 
>>>>>> 
>>>>>> [Times: user=0.14 sys=0.00, real=11.47 secs]
>>>>>> 
>>>>>> Did Kafka's JVM really just take 11.47 secs to do a GC there?  I'm
>>>>>> probably missing something, but I don't see which part of that real
>>>>>> time summary makes up the bulk of that GC time
>>>>>> 
>>>>>> This is strange, riight?  This broker is identically configured to all
>>>>>> its peers, and should be handling on average the exact same amount and
>>>>>> type of traffic.  Anyone have any advice?
>>>>>> 
>>>>>> Thanks!
>>>>>> -Andrew Otto
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Mar 21, 2014 at 6:48 PM, Neha Narkhede <
>>>> neha.narkh...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I see, that makes sense. Thanks a lot for clarifying!
>>>>>>> 
>>>>>>> -Neha
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 21, 2014 at 11:01 AM, Bae, Jae Hyeon <metac...@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Let me clarify the situation. I forgot to mention that my case
>>>> might
>>>>>> not
>>>>>>> be
>>>>>>>> general one because Netflix is using Apache Curator as the main
>>>>>> zookeeper
>>>>>>>> client and ZkClient in Kafka should be bridged to Apache Curator,
>>>> so
>>>>>> the
>>>>>>>> behavior I have seen might not be general one.
>>>>>>>> 
>>>>>>>> Kafka's ZKSessionExpireListener.handleNewSession() is reinstating
>>>> all
>>>>>>>> ephemeral nodes and watchers but handleNewSession() was not kicked
>>>> in
>>>>>> my
>>>>>>>> case. So, I created Netflix internal version of ZkClient to replace
>>>>>>>> ephemeral node creation and watcher reinstating.
>>>>>>>> 
>>>>>>>> I have a plan to remove all external dependency from Kafka soon
>>>> and I
>>>>>>> might
>>>>>>>> be able to mention more about Kafka's zk resiliency.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Mar 21, 2014 at 7:51 AM, Jun Rao <jun...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi, Jae,
>>>>>>>>> 
>>>>>>>>> My understanding is that ephemeral nodes can only be lost on ZK
>>>>>> session
>>>>>>>>> expiration. Are you saying that there are other cases when
>>>>> ephemeral
>>>>>>>> nodes
>>>>>>>>> can be lost?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jun
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Mar 20, 2014 at 9:52 PM, Bae, Jae Hyeon <
>>>>> metac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> This issue is zookeeper resiliency.
>>>>>>>>>> 
>>>>>>>>>> What I have done is, ephemeral node creation is replaced by
>>>>> Apache
>>>>>>>>>> Curator's PersistentEphemeralNode recipe, to reinstate
>>>> ephemeral
>>>>>>> nodes
>>>>>>>>>> after zookeeper blip. Also, all watchers also should be
>>>>> reinstated.
>>>>>>>> Kafka
>>>>>>>>>> internally only handles session expired event but there are
>>>> many
>>>>>> bad
>>>>>>>>>> situations of zookeepers where Kafka can go haywire.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Mar 20, 2014 at 9:23 PM, Neha Narkhede <
>>>>>>>> neha.narkh...@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 1.  What does 'zookeeper state changed (Expired)' mean?
>>>>>>>>>>> 
>>>>>>>>>>> 2.  Has anyone seen issues like this before?  Where zookeeper
>>>>>>>>> connections
>>>>>>>>>>> are flaky enough to cause leader elections?
>>>>>>>>>>> 
>>>>>>>>>>> It means zookeeper expired the session. The most common
>>>> reason
>>>>>> for
>>>>>>>> this
>>>>>>>>>> is
>>>>>>>>>>> client side GC (in your case, client is the Kafka broker)
>>>>> though
>>>>>> if
>>>>>>>>> your
>>>>>>>>>>> zookeeper setup is flaky, this might even point to a server
>>>>> side
>>>>>>> I/O
>>>>>>>>>>> bottleneck or GC. Could you inspect the broker and
>>>> zookeeper's
>>>>> gc
>>>>>>>> logs?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Mar 20, 2014 at 3:26 PM, Andrew Otto <
>>>>> o...@wikimedia.org
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hm, just saw something a little fishy.
>>>>>>>>>>>> 
>>>>>>>>>>>> (In the following logs, analytics1021 (id 21) and
>>>>> analytics1022
>>>>>>> (id
>>>>>>>>> 22)
>>>>>>>>>>>> are Brokers and analytics1023,analytics1024,analytics1025
>>>> are
>>>>>>>>>>> Zookeepers.)
>>>>>>>>>>>> 
>>>>>>>>>>>> At 2014-03-20 21:12:26, analytics1021 lost its connection
>>>> to
>>>>>>>>> zookeeper.
>>>>>>>>>>> It
>>>>>>>>>>>> reconnected to analytics1023, but as it was doing so, it
>>>>>> logged a
>>>>>>>>>>>> 'zookeeper state changed (Expired)' message, which
>>>> apparently
>>>>>>>> caused
>>>>>>>>>> the
>>>>>>>>>>>> newly opened socket to close again.  Am I reading that
>>>> right?
>>>>>>>>>>>> 
>>>>>>>>>>>> In the time that it took analytics1021 to finally
>>>>> re-establish
>>>>>> a
>>>>>>>>>>> zookeeper
>>>>>>>>>>>> connection, analytics1022 had become the leader for all
>>>>>>> partitions.
>>>>>>>>> I
>>>>>>>>>>> can
>>>>>>>>>>>> rebalance the leaders, but I'd prefer if this didn't happen
>>>>> in
>>>>>>> the
>>>>>>>>>> first
>>>>>>>>>>>> place.
>>>>>>>>>>>> 
>>>>>>>>>>>> 1.  What does 'zookeeper state changed (Expired)' mean?
>>>>>>>>>>>> 2.  Has anyone seen issues like this before?  Where
>>>> zookeeper
>>>>>>>>>> connections
>>>>>>>>>>>> are flaky enough to cause leader elections?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> -Andrew Otto
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> [2014-03-20 21:12:26,427] 685698013
>>>>>>>>>>>> [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Client session timed out,
>>>>>> have
>>>>>>>> not
>>>>>>>>>>> heard
>>>>>>>>>>>> from server in 10067ms for sessionid 0x1425d903a3f1f89,
>>>>> closing
>>>>>>>>> socket
>>>>>>>>>>>> connection and attempting reconnect
>>>>>>>>>>>> [2014-03-20 21:12:26,528] 685698114 [main-EventThread] INFO
>>>>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
>>>>>>>> (Disconnected)
>>>>>>>>>>>> [2014-03-20 21:12:27,252] 685698838
>>>>>>>>>>>> [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Opening socket connection
>>>>> to
>>>>>>>> server
>>>>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181
>>>>>>>>>>>> [2014-03-20 21:12:27,271] 685698857
>>>>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Socket connection
>>>>> established
>>>>>>> to
>>>>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating
>>>>> session
>>>>>>>>>>>> [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
>>>>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
>>>>>> (Expired)
>>>>>>>>>>>> [2014-03-20 21:12:27,273] 685698859
>>>>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Unable to reconnect to
>>>>>>> ZooKeeper
>>>>>>>>>>> service,
>>>>>>>>>>>> session 0x1425d903a3f1f89 has expired, closing socket
>>>>>> connection
>>>>>>>>>>>> [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
>>>>>>>>>>>> org.apache.zookeeper.ZooKeeper - Initiating client
>>>>> connection,
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> connectString=analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad
>>>>>>>>>>>> sessionTimeout=6000
>>>>>> watcher=org.I0Itec.zkclient.ZkClient@63203b59
>>>>>>>>>>>> [2014-03-20 21:12:27,290] 685698876
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
>>>>>>>>>>>> INFO kafka.server.KafkaZooKeeper - re-registering broker
>>>> info
>>>>>> in
>>>>>>> ZK
>>>>>>>>> for
>>>>>>>>>>>> broker 21
>>>>>>>>>>>> [2014-03-20 21:12:27,290] 685698876 [main-EventThread] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - EventThread shut down
>>>>>>>>>>>> [2014-03-20 21:12:27,290] 685698876 [main-SendThread()]
>>>> INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Opening socket connection
>>>>> to
>>>>>>>> server
>>>>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181
>>>>>>>>>>>> [2014-03-20 21:12:27,292] 685698878
>>>>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Socket connection
>>>>> established
>>>>>>> to
>>>>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating
>>>>> session
>>>>>>>>>>>> [2014-03-20 21:12:27,313] 685698899
>>>>>>>>>>>> [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn - Session establishment
>>>>>> complete
>>>>>>> on
>>>>>>>>>>> server
>>>>>>>>>>>> analytics1023.eqiad.wmnet/10.64.5.15:2181, sessionid =
>>>>>>>>>>>> 0xff429b4201dd2352, negotiated timeout = 6000
>>>>>>>>>>>> [2014-03-20 21:12:27,314] 685698900 [main-EventThread] INFO
>>>>>>>>>>>> org.I0Itec.zkclient.ZkClient - zookeeper state changed
>>>>>>>>> (SyncConnected)
>>>>>>>>>>>> [2014-03-20 21:12:27,324] 685698910
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
>>>>>>>>>>>> INFO kafka.server.KafkaZooKeeper - done re-registering
>>>> broker
>>>>>>>>>>>> [2014-03-20 21:12:27,324] 685698910
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
>>>>>>>>>>>> INFO kafka.server.KafkaZooKeeper - Subscribing to
>>>>>> /brokers/topics
>>>>>>>>> path
>>>>>>>>>> to
>>>>>>>>>>>> watch for new topics
>>>>>>>>>>>> [2014-03-20 21:12:27,370] 685698956
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
>>>>>>>>>>>> INFO
>>>>> kafka.server.ZookeeperLeaderElector$LeaderChangeListener -
>>>>>>> New
>>>>>>>>>>> leader
>>>>>>>>>>>> is 22
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Zookeeper reconnect failed due to 'state changed (Expired)'

Reply via email to