Thanks Neha, I did not take a thread dump before restarting, will get it when it happens again. We are using 16 Gigs of jvm heap. Do you have a recommendation on jvm GC options.?
Thanks, Raja. On Tue, Sep 3, 2013 at 12:26 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > 2013-09-01 05:59:27,792 [main-EventThread] INFO > (org.I0Itec.zkclient.ZkClient) - zookeeper state changed (Disconnected) > 2013-09-01 05:59:27,692 [main-SendThread( > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > (org.apache.zookeeper. > ClientCnxn) - Client session timed out, have not > heard from server in 4002ms for sessionid 0x140c603da5b0032, closing socket > connection and attempting reconnect > > This indicates that your mirror maker and/or your zookeeper cluster is > GCing for long periods of time. I have observed that if "client session > timed out" happens too many times, the client tends to lose zookeeper > watches. This is a potential bug in zookeeper. If this happens, your mirror > maker instance might not rebalance correctly and will start losing data. > > You mentioned consumption/production stopped on your mirror maker, could > you please take a thread dump and point us to it? Meanwhile, you might want > to fix the GC pauses. > > Thanks, > Neha > > > On Tue, Sep 3, 2013 at 8:59 AM, Rajasekar Elango <rela...@salesforce.com > >wrote: > > > We found that mirrormaker stopped consuming and producing over the week > end > > (09/01). Just seeing "Client session timed out" messages in mirrormaker > > log. I restarted to it today 09/03 to resume processing. Here is the logs > > line in reverse order. > > > > > > 2013-09-03 14:20:40,918 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.utils.VerifiableProperties) - Verifying properties > > 2013-09-03 14:20:40,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > begin rebalancing consumer > > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try > #1 > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > Committing all offsets after clearing the fetcher queues > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > Cleared the data chunks in all the consumer message iterators > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > Cleared all relevant queues for this fetcher > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ConsumerFetcherManager) - > > [ConsumerFetcherManager-1378218012760] All connections stopped > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ConsumerFetcherManager) - > > [ConsumerFetcherManager-1378218012760] Stopping all fetchers > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ConsumerFetcherManager) - > > [ConsumerFetcherManager-1378218012760] Stopping leader finder thread > > 2013-09-03 14:20:38,877 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > Rebalancing attempt failed. Clearing the cache before the next > rebalancing > > operation is triggered > > 2013-09-03 14:20:38,876 > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > end > > rebalancing consumer > > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try > #0 > > 2013-09-01 05:59:29,069 [main-SendThread( > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > (org.apache.zookeeper.ClientCnxn) - Socket connection established to > > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181, initiating session > > 2013-09-01 05:59:29,069 [main-SendThread( > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > (org.apache.zookeeper.ClientCnxn) - Opening socket connection to server > > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181 > > 2013-09-01 05:59:27,792 [main-EventThread] INFO > > (org.I0Itec.zkclient.ZkClient) - zookeeper state changed (Disconnected) > > 2013-09-01 05:59:27,692 [main-SendThread( > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > (org.apache.zookeeper.ClientCnxn) - Client session timed out, have not > > heard from server in 4002ms for sessionid 0x140c603da5b0032, closing > socket > > connection and attempting reconnect > > > > > > As you can see, no log lines appeared after 2013-09-01 05:59:29. I > checked > > lag using consumerOffsetChecker and observed that log size and lag is > > growing, but offset of mirrormaker remains same. We have two mirrormaker > > process running and both of them had same issue during same time frame.. > > Any hint on what could be problem..? How do we go about trouble shooting > > this..? > > > > Thanks in advance.. > > > > -- > > Thanks, > > Raja. > > > -- Thanks, Raja.