Are you on Linux? We have seen this pattern (user/sys time low and real
time high in GC time) before. In our case, the problem was due to disk
I/Os. When there are lots of dirty pages (in our case, this is caused by
log4j logging), Linux can draft user threads (in this case GC threads) to
flush the dirty pages. So, all those time in real was spent on disk I/Os,
rather than real GCs. The fix is to tune dirty_expire_centisecs and
dirty_writeback_centisecs
to flush dirty pages more frequently to avoid such drafting.

Thanks,

Jun


On Wed, Jul 2, 2014 at 1:32 PM, Andrew Otto <o...@wikimedia.org> wrote:

> Hi again!
>
> I've been having this issue consistently since I first started this thread,
> but it was happening infrequently enough for me to brush it aside and just
> run an election to rebalance brokers.
>
> I recently expanded (and reinstalled) our Kafka cluster so that it now has
> 4 brokers with a default replication factor of 3 for each partition.  I
> also switched over to the G1GC as recommended here:
> https://kafka.apache.org/081/ops.html (even though we are still running
> Kafka 0.8.0; we hope to upgrade soon).
>
> Now, only one of the 4 brokers (analytics1021, the same problem broker we
> saw before) gets its ZK connection expired even more frequently.
>  Previously it was less than once a week, now I am seeing this happen
> multiple times a day.
>
> I've posted all the relevant logs from a recent event here:
> https://gist.github.com/ottomata/e42480446c627ea0af22
>
> This includes the GC log on the offending Kafka broker during the time this
> happened.  I am pretty green when it comes to GC tuning, but I do see this
> interesting stat:
>
>
>  [Times: user=0.14 sys=0.00, real=11.47 secs]
>
> Did Kafka's JVM really just take 11.47 secs to do a GC there?  I'm
> probably missing something, but I don't see which part of that real
> time summary makes up the bulk of that GC time
>
> This is strange, riight?  This broker is identically configured to all
> its peers, and should be handling on average the exact same amount and
> type of traffic.  Anyone have any advice?
>
> Thanks!
> -Andrew Otto
>
>
>
>
> On Fri, Mar 21, 2014 at 6:48 PM, Neha Narkhede <neha.narkh...@gmail.com>
> wrote:
>
> > I see, that makes sense. Thanks a lot for clarifying!
> >
> > -Neha
> >
> >
> > On Fri, Mar 21, 2014 at 11:01 AM, Bae, Jae Hyeon <metac...@gmail.com>
> > wrote:
> >
> > > Let me clarify the situation. I forgot to mention that my case might
> not
> > be
> > > general one because Netflix is using Apache Curator as the main
> zookeeper
> > > client and ZkClient in Kafka should be bridged to Apache Curator, so
> the
> > > behavior I have seen might not be general one.
> > >
> > > Kafka's ZKSessionExpireListener.handleNewSession() is reinstating all
> > > ephemeral nodes and watchers but handleNewSession() was not kicked in
> my
> > > case. So, I created Netflix internal version of ZkClient to replace
> > > ephemeral node creation and watcher reinstating.
> > >
> > > I have a plan to remove all external dependency from Kafka soon and I
> > might
> > > be able to mention more about Kafka's zk resiliency.
> > >
> > >
> > > On Fri, Mar 21, 2014 at 7:51 AM, Jun Rao <jun...@gmail.com> wrote:
> > >
> > > > Hi, Jae,
> > > >
> > > > My understanding is that ephemeral nodes can only be lost on ZK
> session
> > > > expiration. Are you saying that there are other cases when ephemeral
> > > nodes
> > > > can be lost?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Thu, Mar 20, 2014 at 9:52 PM, Bae, Jae Hyeon <metac...@gmail.com>
> > > > wrote:
> > > >
> > > > > This issue is zookeeper resiliency.
> > > > >
> > > > > What I have done is, ephemeral node creation is replaced by Apache
> > > > > Curator's PersistentEphemeralNode recipe, to reinstate ephemeral
> > nodes
> > > > > after zookeeper blip. Also, all watchers also should be reinstated.
> > > Kafka
> > > > > internally only handles session expired event but there are many
> bad
> > > > > situations of zookeepers where Kafka can go haywire.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 20, 2014 at 9:23 PM, Neha Narkhede <
> > > neha.narkh...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > 1.  What does 'zookeeper state changed (Expired)' mean?
> > > > > >
> > > > > > 2.  Has anyone seen issues like this before?  Where zookeeper
> > > > connections
> > > > > > are flaky enough to cause leader elections?
> > > > > >
> > > > > > It means zookeeper expired the session. The most common reason
> for
> > > this
> > > > > is
> > > > > > client side GC (in your case, client is the Kafka broker) though
> if
> > > > your
> > > > > > zookeeper setup is flaky, this might even point to a server side
> > I/O
> > > > > > bottleneck or GC. Could you inspect the broker and zookeeper's gc
> > > logs?
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 20, 2014 at 3:26 PM, Andrew Otto <o...@wikimedia.org
> >
> > > > wrote:
> > > > > >
> > > > > > > Hm, just saw something a little fishy.
> > > > > > >
> > > > > > > (In the following logs, analytics1021 (id 21) and analytics1022
> > (id
> > > > 22)
> > > > > > > are Brokers and analytics1023,analytics1024,analytics1025 are
> > > > > > Zookeepers.)
> > > > > > >
> > > > > > > At 2014-03-20 21:12:26, analytics1021 lost its connection to
> > > > zookeeper.
> > > > > > It
> > > > > > > reconnected to analytics1023, but as it was doing so, it
> logged a
> > > > > > > 'zookeeper state changed (Expired)' message, which apparently
> > > caused
> > > > > the
> > > > > > > newly opened socket to close again.  Am I reading that right?
> > > > > > >
> > > > > > > In the time that it took analytics1021 to finally re-establish
> a
> > > > > > zookeeper
> > > > > > > connection, analytics1022 had become the leader for all
> > partitions.
> > > >  I
> > > > > > can
> > > > > > > rebalance the leaders, but I'd prefer if this didn't happen in
> > the
> > > > > first
> > > > > > > place.
> > > > > > >
> > > > > > > 1.  What does 'zookeeper state changed (Expired)' mean?
> > > > > > > 2.  Has anyone seen issues like this before?  Where zookeeper
> > > > > connections
> > > > > > > are flaky enough to cause leader elections?
> > > > > > >
> > > > > > > Thanks!
> > > > > > > -Andrew Otto
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > [2014-03-20 21:12:26,427] 685698013
> > > > > > > [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Client session timed out,
> have
> > > not
> > > > > > heard
> > > > > > > from server in 10067ms for sessionid 0x1425d903a3f1f89, closing
> > > > socket
> > > > > > > connection and attempting reconnect
> > > > > > > [2014-03-20 21:12:26,528] 685698114 [main-EventThread] INFO
> > > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed
> > > (Disconnected)
> > > > > > > [2014-03-20 21:12:27,252] 685698838
> > > > > > > [main-SendThread(analytics1025.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Opening socket connection to
> > > server
> > > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181
> > > > > > > [2014-03-20 21:12:27,271] 685698857
> > > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Socket connection established
> > to
> > > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating session
> > > > > > > [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
> > > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed
> (Expired)
> > > > > > > [2014-03-20 21:12:27,273] 685698859
> > > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Unable to reconnect to
> > ZooKeeper
> > > > > > service,
> > > > > > > session 0x1425d903a3f1f89 has expired, closing socket
> connection
> > > > > > > [2014-03-20 21:12:27,273] 685698859 [main-EventThread] INFO
> > > > > > > org.apache.zookeeper.ZooKeeper - Initiating client connection,
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> connectString=analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad
> > > > > > > sessionTimeout=6000
> watcher=org.I0Itec.zkclient.ZkClient@63203b59
> > > > > > > [2014-03-20 21:12:27,290] 685698876
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> > > > > > > INFO kafka.server.KafkaZooKeeper - re-registering broker info
> in
> > ZK
> > > > for
> > > > > > > broker 21
> > > > > > > [2014-03-20 21:12:27,290] 685698876 [main-EventThread] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - EventThread shut down
> > > > > > > [2014-03-20 21:12:27,290] 685698876 [main-SendThread()] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Opening socket connection to
> > > server
> > > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181
> > > > > > > [2014-03-20 21:12:27,292] 685698878
> > > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Socket connection established
> > to
> > > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, initiating session
> > > > > > > [2014-03-20 21:12:27,313] 685698899
> > > > > > > [main-SendThread(analytics1023.eqiad.wmnet:2181)] INFO
> > > > > > > org.apache.zookeeper.ClientCnxn - Session establishment
> complete
> > on
> > > > > > server
> > > > > > > analytics1023.eqiad.wmnet/10.64.5.15:2181, sessionid =
> > > > > > > 0xff429b4201dd2352, negotiated timeout = 6000
> > > > > > > [2014-03-20 21:12:27,314] 685698900 [main-EventThread] INFO
> > > > > > > org.I0Itec.zkclient.ZkClient - zookeeper state changed
> > > > (SyncConnected)
> > > > > > > [2014-03-20 21:12:27,324] 685698910
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> > > > > > > INFO kafka.server.KafkaZooKeeper - done re-registering broker
> > > > > > > [2014-03-20 21:12:27,324] 685698910
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> > > > > > > INFO kafka.server.KafkaZooKeeper - Subscribing to
> /brokers/topics
> > > > path
> > > > > to
> > > > > > > watch for new topics
> > > > > > > [2014-03-20 21:12:27,370] 685698956
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> [ZkClient-EventThread-24-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad]
> > > > > > > INFO kafka.server.ZookeeperLeaderElector$LeaderChangeListener -
> > New
> > > > > > leader
> > > > > > > is 22
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to