This sounds an awful lot like GC activity. Have you checked to see if you are 
having full GCs that would pause your application and cause a timeout?

-Clark

Sent from my iPhone

> On Oct 7, 2015, at 8:39 AM, John Firth (JIRA) <j...@apache.org> wrote:
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> John Firth updated KAFKA-2572:
> ------------------------------
>    Description: 
> On several occasions, we've seen our process enter a cycle of: zk session 
> expiry; new session creation; rebalancing activity; pause during which 
> nothing is heard from the zk server. Sometimes, the reconnections are 
> successful, elements are pulled from Kafka, but then disconnection and 
> reconnection occurs shortly thereafter, causing OOMs when new elements are 
> pulled in (although OOMs were not seen in the two cases attached as 
> examples). Restarting the process that uses the zk client resolved the 
> problems in both cases.
> 
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.
> 
> The 09/11 case shows repeated cycles of session expiry, followed by 
> rebalancing activity, followed by a pause during which nothing is heard from 
> the zk server, followed by a session timeout. A stable session seems to have 
> been established at 2015-09-11T04:13:47.140-04:00, but messages of the form 
> "I wrote this conflicted ephemeral node 
> [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
>  at 
> /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
>  a while back in a different session, hence I will backoff for this node to 
> be deleted by Zookeeper and retry" were logged out repeatedly until we 
> restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the 
> final entry in the log.
> 
> The 09/08 case is a little more straightforward than the 09/11 case, in that 
> a stable session was not established prior to our restarting the process.
> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the 
> same session are seen during a single rebalance, at 
> 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The 
> rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
> The connection to ZK expires and is restablished multiple times before the 
> process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
> entry in the logs for this day.
> 
>  was:
> On two occasions, we've seen our process enter a cycle of: zk session expiry; 
> new session creation; rebalancing activity; pause during which nothing is 
> heard from the zk server. Sometimes, the reconnections are successful, 
> elements are pulled from Kafka, but then disconnection and reconnection 
> occurs shortly thereafter, causing OOMs when new elements are pulled in. 
> Restarting the process that uses the zk client resolved the problems in both 
> cases.
> 
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
> all logs entries minus entries particular to our application. For 09/08, the 
> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
> only error and warning entries, and entries containing any of: "begin 
> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
> excised for clarity. Unfortunately, debug logging was not enabled during 
> these events.
> 
> The 09/11 case shows repeated cycles of session expiry, followed by 
> rebalancing activity, followed by a pause during which nothing is heard from 
> the zk server, followed by a session timeout. A stable session seems to have 
> been established at 2015-09-11T04:13:47.140-04:00, but messages of the form 
> "I wrote this conflicted ephemeral node 
> [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
>  at 
> /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
>  a while back in a different session, hence I will backoff for this node to 
> be deleted by Zookeeper and retry" were logged out repeatedly until we 
> restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the 
> final entry in the log.
> 
> The 09/08 case is a little more straightforward than the 09/11 case, in that 
> a stable session was not established prior to our restarting the process.
> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the 
> same session are seen during a single rebalance, at 
> 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The 
> rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
> The connection to ZK expires and is restablished multiple times before the 
> process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
> entry in the logs for this day.
> 
> 
>> zk connection instability
>> -------------------------
>> 
>>                Key: KAFKA-2572
>>                URL: https://issues.apache.org/jira/browse/KAFKA-2572
>>            Project: Kafka
>>         Issue Type: Bug
>>         Components: zkclient
>>   Affects Versions: 0.8.2.1
>>        Environment: zk version 3.4.6,
>> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>>           Reporter: John Firth
>>        Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 
>> 091115-full.log.zip
>> 
>> 
>> On several occasions, we've seen our process enter a cycle of: zk session 
>> expiry; new session creation; rebalancing activity; pause during which 
>> nothing is heard from the zk server. Sometimes, the reconnections are 
>> successful, elements are pulled from Kafka, but then disconnection and 
>> reconnection occurs shortly thereafter, causing OOMs when new elements are 
>> pulled in (although OOMs were not seen in the two cases attached as 
>> examples). Restarting the process that uses the zk client resolved the 
>> problems in both cases.
>> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show 
>> all logs entries minus entries particular to our application. For 09/08, the 
>> time span is 2015-09-08T12:52:06.069-04:00 to 2015-09-08T13:14:48.250-04:00; 
>> for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00 to 
>> 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining 
>> only error and warning entries, and entries containing any of: "begin 
>> rebalancing", "end rebalancing", "timed", and "zookeeper state". For the 
>> 09/11 digest logs, entries from the kafka.network.Processor logger are also 
>> excised for clarity. Unfortunately, debug logging was not enabled during 
>> these events.
>> The 09/11 case shows repeated cycles of session expiry, followed by 
>> rebalancing activity, followed by a pause during which nothing is heard from 
>> the zk server, followed by a session timeout. A stable session seems to have 
>> been established at 2015-09-11T04:13:47.140-04:00, but messages of the form 
>> "I wrote this conflicted ephemeral node 
>> [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
>>  at 
>> /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
>>  a while back in a different session, hence I will backoff for this node to 
>> be deleted by Zookeeper and retry" were logged out repeatedly until we 
>> restarted the process after 2015-09-11T07:44:47.124-04:00, which marks the 
>> final entry in the log.
>> The 09/08 case is a little more straightforward than the 09/11 case, in that 
>> a stable session was not established prior to our restarting the process.
>> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the 
>> same session are seen during a single rebalance, at 
>> 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00. The 
>> rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
>> The connection to ZK expires and is restablished multiple times before the 
>> process is killed after 2015-09-08T13:13:41.655-04:00, which marks the last 
>> entry in the logs for this day.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to