Thanks Neha,

I did not take a thread dump before restarting, will get it when it happens
again. We are using 16 Gigs of jvm heap. Do you have a recommendation on
jvm GC options.?

Thanks,
Raja.


On Tue, Sep 3, 2013 at 12:26 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote:

> 2013-09-01 05:59:27,792 [main-EventThread] INFO
>  (org.I0Itec.zkclient.ZkClient)  - zookeeper state changed (Disconnected)
> 2013-09-01 05:59:27,692 [main-SendThread(
> mandm-zookeeper-asg.data.sfdc.net:2181)] INFO
>  (org.apache.zookeeper.
> ClientCnxn)  - Client session timed out, have not
> heard from server in 4002ms for sessionid 0x140c603da5b0032, closing socket
> connection and attempting reconnect
>
> This indicates that your mirror maker and/or your zookeeper cluster is
> GCing for long periods of time. I have observed that if "client session
> timed out" happens too many times, the client tends to lose zookeeper
> watches. This is a potential bug in zookeeper. If this happens, your mirror
> maker instance might not rebalance correctly and will start losing data.
>
> You mentioned consumption/production stopped on your mirror maker, could
> you please take a thread dump and point us to it? Meanwhile, you might want
> to fix the GC pauses.
>
> Thanks,
> Neha
>
>
> On Tue, Sep 3, 2013 at 8:59 AM, Rajasekar Elango <rela...@salesforce.com
> >wrote:
>
> > We found that mirrormaker stopped consuming and producing over the week
> end
> > (09/01). Just seeing "Client session timed out" messages in mirrormaker
> > log. I restarted to it today 09/03 to resume processing. Here is the logs
> > line in reverse order.
> >
> >
> > 2013-09-03 14:20:40,918
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.utils.VerifiableProperties)  - Verifying properties
> > 2013-09-03 14:20:40,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> > begin rebalancing consumer
> > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try
> #1
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> > Committing all offsets after clearing the fetcher queues
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> > Cleared the data chunks in all the consumer message iterators
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> > Cleared all relevant queues for this fetcher
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ConsumerFetcherManager)  -
> > [ConsumerFetcherManager-1378218012760] All connections stopped
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ConsumerFetcherManager)  -
> > [ConsumerFetcherManager-1378218012760] Stopping all fetchers
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ConsumerFetcherManager)  -
> > [ConsumerFetcherManager-1378218012760] Stopping leader finder thread
> > 2013-09-03 14:20:38,877
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> > Rebalancing attempt failed. Clearing the cache before the next
> rebalancing
> > operation is triggered
> > 2013-09-03 14:20:38,876
> >
> >
> [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506_watcher_executor]
> > INFO  (kafka.consumer.ZookeeperConsumerConnector)  -
> > [mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506],
> end
> > rebalancing consumer
> > mirrormakerProd_ops-mmrs1-1-asg.ops.sfdc.net-1378218012575-6779d506 try
> #0
> > 2013-09-01 05:59:29,069 [main-SendThread(
> > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO
> >  (org.apache.zookeeper.ClientCnxn)  - Socket connection established to
> > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181, initiating session
> > 2013-09-01 05:59:29,069 [main-SendThread(
> > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO
> >  (org.apache.zookeeper.ClientCnxn)  - Opening socket connection to server
> > mandm-zookeeper-asg.data.sfdc.net/10.228.48.38:2181
> > 2013-09-01 05:59:27,792 [main-EventThread] INFO
> >  (org.I0Itec.zkclient.ZkClient)  - zookeeper state changed (Disconnected)
> > 2013-09-01 05:59:27,692 [main-SendThread(
> > mandm-zookeeper-asg.data.sfdc.net:2181)] INFO
> >  (org.apache.zookeeper.ClientCnxn)  - Client session timed out, have not
> > heard from server in 4002ms for sessionid 0x140c603da5b0032, closing
> socket
> > connection and attempting reconnect
> >
> >
> > As you can see, no log lines appeared after 2013-09-01 05:59:29. I
> checked
> > lag using consumerOffsetChecker and observed that log size and lag is
> > growing, but offset of mirrormaker remains same. We have two mirrormaker
> > process running and both of them had same issue during same time frame..
> > Any hint on what could be problem..? How do we go about trouble shooting
> > this..?
> >
> > Thanks in advance..
> >
> > --
> > Thanks,
> > Raja.
> >
>



-- 
Thanks,
Raja.

Reply via email to