This sounds an awful lot like GC activity. Have you checked to see if you are having full GCs that would pause your application and cause a timeout?
-Clark Sent from my iPhone > On Oct 7, 2015, at 8:39 AM, John Firth (JIRA) <j...@apache.org> wrote: > > > [ > https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > John Firth updated KAFKA-2572: > ------------------------------ > Description: > On several occasions, we've seen our process enter a cycle of: zk session > expiry; new session creation; rebalancing activity; pause during which > nothing is heard from the zk server. Sometimes, the reconnections are > successful, elements are pulled from Kafka, but then disconnection and > reconnection occurs shortly thereafter, causing OOMs when new elements are > pulled in (although OOMs were not seen in the two cases attached as > examples). Restarting the process that uses the zk client resolved the > problems in both cases. > > This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show > all logs entries minus entries particular to our application. For 09/08, the > time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; > for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to > 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining > only error and warning entries, and entries containing any of: "begin > rebalancing", "end rebalancing", "timed", and "zookeeper state". For the > 09/11 digest logs, entries from the kafka.network.Processor logger are also > excised for clarity. Unfortunately, debug logging was not enabled during > these events. > > The 09/11 case shows repeated cycles of session expiry, followed by > rebalancing activity, followed by a pause during which nothing is heard from > the zk server, followed by a session timeout. A stable session seems to have > been established at 2015-09-11T04:13:47.140-04:00, but messages of the form > "I wrote this conflicted ephemeral node > [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}] > at > /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718 > a while back in a different session, hence I will backoff for this node to > be deleted by Zookeeper and retry" were logged out repeatedly until we > restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the > final entry in the log. > > The 09/08 case is a little more straightforward than the 09/11 case, in that > a stable session was not established prior to our restarting the process. > It's perhaps also noteworthy that in the 09/08 case, two timeouts for the > same session are seen during a single rebalance, at > 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The > rebalance in question begins at 2015-09-08T12:52:06.667-04:00. > The connection to ZK expires and is restablished multiple times before the > process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last > entry in the logs for this day. > > was: > On two occasions, we've seen our process enter a cycle of: zk session expiry; > new session creation; rebalancing activity; pause during which nothing is > heard from the zk server. Sometimes, the reconnections are successful, > elements are pulled from Kafka, but then disconnection and reconnection > occurs shortly thereafter, causing OOMs when new elements are pulled in. > Restarting the process that uses the zk client resolved the problems in both > cases. > > This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show > all logs entries minus entries particular to our application. For 09/08, the > time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; > for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to > 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining > only error and warning entries, and entries containing any of: "begin > rebalancing", "end rebalancing", "timed", and "zookeeper state". For the > 09/11 digest logs, entries from the kafka.network.Processor logger are also > excised for clarity. Unfortunately, debug logging was not enabled during > these events. > > The 09/11 case shows repeated cycles of session expiry, followed by > rebalancing activity, followed by a pause during which nothing is heard from > the zk server, followed by a session timeout. A stable session seems to have > been established at 2015-09-11T04:13:47.140-04:00, but messages of the form > "I wrote this conflicted ephemeral node > [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}] > at > /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718 > a while back in a different session, hence I will backoff for this node to > be deleted by Zookeeper and retry" were logged out repeatedly until we > restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the > final entry in the log. > > The 09/08 case is a little more straightforward than the 09/11 case, in that > a stable session was not established prior to our restarting the process. > It's perhaps also noteworthy that in the 09/08 case, two timeouts for the > same session are seen during a single rebalance, at > 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The > rebalance in question begins at 2015-09-08T12:52:06.667-04:00. > The connection to ZK expires and is restablished multiple times before the > process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last > entry in the logs for this day. > > >> zk connection instability >> ------------------------- >> >> Key: KAFKA-2572 >> URL: https://issues.apache.org/jira/browse/KAFKA-2572 >> Project: Kafka >> Issue Type: Bug >> Components: zkclient >> Affects Versions: 0.8.2.1 >> Environment: zk version 3.4.6, >> CentOS 6, 2.6.32-504.1.3.el6.x86_64 >> Reporter: John Firth >> Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, >> 091115-full.log.zip >> >> >> On several occasions, we've seen our process enter a cycle of: zk session >> expiry; new session creation; rebalancing activity; pause during which >> nothing is heard from the zk server. Sometimes, the reconnections are >> successful, elements are pulled from Kafka, but then disconnection and >> reconnection occurs shortly thereafter, causing OOMs when new elements are >> pulled in (although OOMs were not seen in the two cases attached as >> examples). Restarting the process that uses the zk client resolved the >> problems in both cases. >> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show >> all logs entries minus entries particular to our application. For 09/08, the >> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; >> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to >> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining >> only error and warning entries, and entries containing any of: "begin >> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the >> 09/11 digest logs, entries from the kafka.network.Processor logger are also >> excised for clarity. Unfortunately, debug logging was not enabled during >> these events. >> The 09/11 case shows repeated cycles of session expiry, followed by >> rebalancing activity, followed by a pause during which nothing is heard from >> the zk server, followed by a session timeout. A stable session seems to have >> been established at 2015-09-11T04:13:47.140-04:00, but messages of the form >> "I wrote this conflicted ephemeral node >> [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}] >> at >> /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718 >> a while back in a different session, hence I will backoff for this node to >> be deleted by Zookeeper and retry" were logged out repeatedly until we >> restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the >> final entry in the log. >> The 09/08 case is a little more straightforward than the 09/11 case, in that >> a stable session was not established prior to our restarting the process. >> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the >> same session are seen during a single rebalance, at >> 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The >> rebalance in question begins at 2015-09-08T12:52:06.667-04:00. >> The connection to ZK expires and is restablished multiple times before the >> process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last >> entry in the logs for this day. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)