16 GB is a very large heap. GC tuning becomes trickier as the size of the heap increases. Are you sure you need that much memory to operate the mirror maker? For us, the following GC settings have worked well - https://cwiki.apache.org/confluence/display/KAFKA/Operations#Operations-Java
Thanks, Neha On Tue, Sep 3, 2013 at 10:40 AM, Rajasekar Elango <rela...@salesforce.com>wrote: > Thanks Neha, > > I did not take a thread dump before restarting, will get it when it happens > again. We are using 16 Gigs of jvm heap. Do you have a recommendation on > jvm GC options.? > > Thanks, > Raja. > > > On Tue, Sep 3, 2013 at 12:26 PM, Neha Narkhede <neha.narkh...@gmail.com > >wrote: > > > 2013-09-01 05:59:27,792 [main-EventThread] INFO > > (org.I0Itec.zkclient.ZkClient) - zookeeper state changed (Disconnected) > > 2013-09-01 05:59:27,692 [main-SendThread( > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > (org.apache.zookeeper. > > ClientCnxn) - Client session timed out, have not > > heard from server in 4002ms for sessionid 0x140c603da5b0032, closing > socket > > connection and attempting reconnect > > > > This indicates that your mirror maker and/or your zookeeper cluster is > > GCing for long periods of time. I have observed that if "client session > > timed out" happens too many times, the client tends to lose zookeeper > > watches. This is a potential bug in zookeeper. If this happens, your > mirror > > maker instance might not rebalance correctly and will start losing data. > > > > You mentioned consumption/production stopped on your mirror maker, could > > you please take a thread dump and point us to it? Meanwhile, you might > want > > to fix the GC pauses. > > > > Thanks, > > Neha > > > > > > On Tue, Sep 3, 2013 at 8:59 AM, Rajasekar Elango <rela...@salesforce.com > > >wrote: > > > > > We found that mirrormaker stopped consuming and producing over the week > > end > > > (09/01). Just seeing "Client session timed out" messages in mirrormaker > > > log. I restarted to it today 09/03 to resume processing. Here is the > logs > > > line in reverse order. > > > > > > > > > 2013-09-03 14:20:40,918 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.utils.VerifiableProperties) - Verifying properties > > > 2013-09-03 14:20:40,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > > begin rebalancing consumer > > > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try > > #1 > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > > Committing all offsets after clearing the fetcher queues > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > > Cleared the data chunks in all the consumer message iterators > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > > Cleared all relevant queues for this fetcher > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ConsumerFetcherManager) - > > > [ConsumerFetcherManager-1378218012760] All connections stopped > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ConsumerFetcherManager) - > > > [ConsumerFetcherManager-1378218012760] Stopping all fetchers > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ConsumerFetcherManager) - > > > [ConsumerFetcherManager-1378218012760] Stopping leader finder thread > > > 2013-09-03 14:20:38,877 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > > Rebalancing attempt failed. Clearing the cache before the next > > rebalancing > > > operation is triggered > > > 2013-09-03 14:20:38,876 > > > > > > > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor] > > > INFO (kafka.consumer.ZookeeperConsumerConnector) - > > > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506], > > end > > > rebalancing consumer > > > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try > > #0 > > > 2013-09-01 05:59:29,069 [main-SendThread( > > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > > (org.apache.zookeeper.ClientCnxn) - Socket connection established to > > > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181, initiating > session > > > 2013-09-01 05:59:29,069 [main-SendThread( > > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > > (org.apache.zookeeper.ClientCnxn) - Opening socket connection to > server > > > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181 > > > 2013-09-01 05:59:27,792 [main-EventThread] INFO > > > (org.I0Itec.zkclient.ZkClient) - zookeeper state changed > (Disconnected) > > > 2013-09-01 05:59:27,692 [main-SendThread( > > > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO > > > (org.apache.zookeeper.ClientCnxn) - Client session timed out, have > not > > > heard from server in 4002ms for sessionid 0x140c603da5b0032, closing > > socket > > > connection and attempting reconnect > > > > > > > > > As you can see, no log lines appeared after 2013-09-01 05:59:29. I > > checked > > > lag using consumerOffsetChecker and observed that log size and lag is > > > growing, but offset of mirrormaker remains same. We have two > mirrormaker > > > process running and both of them had same issue during same time > frame.. > > > Any hint on what could be problem..? How do we go about trouble > shooting > > > this..? > > > > > > Thanks in advance.. > > > > > > -- > > > Thanks, > > > Raja. > > > > > > > > > -- > Thanks, > Raja. >