Re: Rebalance not happening even after increasing max retries causing conflict in ZK

Mohit Kathuria Tue, 16 Dec 2014 00:33:07 -0800

Any suggestions what might be going on here. We are very much blinded here
and our application is getting effected due to this.



-Mohit

On Tue, Dec 9, 2014 at 8:41 PM, Mohit Kathuria <mkathu...@sprinklr.com>
wrote:
>
> Neha,
>
> The same issue reoccured with just 2 consumer processes. The exception was
> related to conflict in writing the ephemeral node. Below was the exception.
> Topic name is
>  "lst_plugin_com.spr.listening.plugin.impl.plugins.SemantriaEnrichmentPlugin"
> with 30 partitions. The 2 processes were running on 2 servers with ips
> 10.0.8.222 and 10.0.8.225.
>
> *2014-12-09 13:22:11 k.u.ZkUtils$ [INFO] I wrote this conflicted ephemeral
> node
> [{"version":1,"subscription":{"lst_plugin_com.spr.listening.plugin.impl.plugins.SemantriaEnrichmentPlugin":5},"pattern":"static","timestamp":"1417964160024"}]
> at
> /consumers/lst_plugin_com.spr.listening.plugin.impl.plugins.SemantriaEnrichmentPlugin/ids/lst_plugin_com.spr.listening.plugin.impl.plugins.SemantriaEnrichmentPlugin_ip-10-0-8-222-1417963753598-b19de58d
> a while back in a different session, hence I will backoff for this node to
> be deleted by Zookeeper and retry*
> Attached the complete error logs. The exception occured after the
> rebalance failed even after 40 retries. Rebalance failed as the process
> already owning some of the partitions did not give us ownership due to
> conflicting ephemeral nodes. As you suggested, we ran the wchp command  on
> the 3 zookeeper nodes at this time and figured out that the watcher was
> registered for only one of the process. I am copying the kafka consumer
> watcher registered on one of the zookeeper servers. (Attached are the wchp
> outputs of all 3 zk servers)
>
> *$echo wchp | nc localhost 2181 *
>
>
> */kafka/consumers/lst_plugin_com.spr.listening.plugin.impl.plugins.SemantriaEnrichmentPlugin/ids*
>
> * 0x34a175e1d5d0130*
>
>
> "0x34a175e1d5d0130" was the ephemeral node session Id. I went back to the
> zookeeper shell and checked the consumers registered for this topic and
> consumer group(same as topic name). Attaching the output in zkCommands.txt.
> This clearly shows that
>
> 10.0.8.222 has ephemeralOwner = 0x34a175e1d5d0130
>
> 10.0.8.225 has ephemeralOwner = 0x34a175e1d5d0127
>
>
> I think we have the issue here that both consumers have written to
> different ephemeral nodes. Watchers are registered for the one of the 2
> ephemeral node. The root cause seems to be the inconsistent state while
> writing the ephemeral nodes in ZK.
>
> Let me know if you need more details.
>
> -Thanks,
>
> Mohit
>
>
>
>
> On Mon, Nov 10, 2014 at 8:46 AM, Neha Narkhede <neha.narkh...@gmail.com>
> wrote:
>
>> A rebalance should trigger on all consumers when you add a new consumer to
>> the group. If you don't see the zookeeper watch fire, the consumer may
>> have
>> somehow lost the watch. We have seen this behavior on older zk versions, I
>> wonder it that bug got reintroduced. To verify if this is the case, you
>> can
>> run the wchp zookeeper command on the zk leader and check if each consumer
>> has a watch registered.
>>
>> Do you have a way to try this on zk 3.3.4? I would recommend you try the
>> wchp suggestion as well.
>>
>> On Fri, Nov 7, 2014 at 6:07 AM, Mohit Kathuria <mkathu...@sprinklr.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > Can someone help here. We are getting constant rebalance failure each
>> time
>> > a consumer is added beyond a certain number. Did quite a lot of
>> debugging
>> > on this and still not able to figure out the pattern.
>> >
>> > -Thanks,
>> > Mohit
>> >
>> > On Mon, Nov 3, 2014 at 10:53 PM, Mohit Kathuria <mkathu...@sprinklr.com
>> >
>> > wrote:
>> >
>> > > Neha,
>> > >
>> > > Looks like an issue with the consumer rebalance not able to complete
>> > > successfully. We were able to reproduce the issue on topic with 30
>> > > partitions,  3 consumer processes(p1,p2 and p3), properties -  40
>> > > rebalance.max.retries and 10000(10s) rebalance.backoff.ms.
>> > >
>> > > Before the process p3 was started, partition ownership was as
>> expected:
>> > >
>> > > partitions 0-14 owned by p1
>> > > partitions 15-29 -> owner p2
>> > >
>> > > As the process p3 started, rebalance was triggered. Process p3 was
>> > > successfully able to acquire partition ownership for partitions 20-29
>> as
>> > > expected as per the rebalance algorithm. However, process p2 while
>> trying
>> > > to acquire ownership of partitions 10-19 saw rebalance failure after
>> 40
>> > > retries.
>> > >
>> > > Attaching the logs from process p2 and process p1. It says that p2 was
>> > > attempting to rebalance, it was trying to acquire ownership of
>> partitions
>> > > 10-14 which were owned by process p1. However, at the same time
>> process
>> > p1
>> > > did not get any event for giving up the partition ownership for
>> > partitions
>> > > 1-14.
>> > > We were expecting a rebalance to have triggered in p1 - but it didn't
>> and
>> > > hence not giving up ownership. Is our assumption correct/incorrect?
>> > > And if the rebalance gets triggered in p1 - how to figure out apart
>> from
>> > > logs as the logs on p1 did not have anything.
>> > >
>> > > *2014-11-03 06:57:36 k.c.ZookeeperConsumerConnector [INFO]
>> > > [topic_consumerIdString], waiting for the partition ownership to be
>> > > deleted: 11*
>> > >
>> > > During and after the rebalance failed on process p2, Partition
>> Ownership
>> > > was as below:
>> > > 0-14 -> owner p1
>> > > 15-19 -> none
>> > > 20-29 -> owner p3
>> > >
>> > > This left the consumers in inconsistent state as 5 partitions were
>> never
>> > > consumer from and neither was the partitions ownership balanced.
>> > >
>> > > However, there was no conflict in creating the ephemeral node which
>> was
>> > > the case last time. Just to note that the ephemeral node conflict
>> which
>> > we
>> > > were seeing earlier also appeared after rebalance failed. My hunch is
>> > that
>> > > fixing the rebalance failure will fix that issue as well.
>> > >
>> > > -Thanks,
>> > > Mohit
>> > >
>> > >
>> > >
>> > > On Mon, Oct 20, 2014 at 7:48 PM, Neha Narkhede <
>> neha.narkh...@gmail.com>
>> > > wrote:
>> > >
>> > >> Mohit,
>> > >>
>> > >> I wonder if it is related to
>> > >> https://issues.apache.org/jira/browse/KAFKA-1585. When zookeeper
>> > expires
>> > >> a
>> > >> session, it doesn't delete the ephemeral nodes immediately. So if you
>> > end
>> > >> up trying to recreate ephemeral nodes quickly, it could either be in
>> the
>> > >> valid latest session or from the previously expired session. If you
>> hit
>> > >> this problem, then waiting would resolve it. But if not, then this
>> may
>> > be
>> > >> a
>> > >> legitimate bug in ZK 3.4.6.
>> > >>
>> > >> Can you try shutting down all your consumers, waiting until session
>> > >> timeout
>> > >> and restarting them?
>> > >>
>> > >> Thanks,
>> > >> Neha
>> > >>
>> > >> On Mon, Oct 20, 2014 at 6:15 AM, Mohit Kathuria <
>> mkathu...@sprinklr.com
>> > >
>> > >> wrote:
>> > >>
>> > >> > Dear Experts,
>> > >> >
>> > >> > We recently updated to kafka v0.8.1.1 with zookeeper v3.4.5. I
>> have of
>> > >> > topic with 30 partitions and 2 replicas. We are using High level
>> > >> consumer
>> > >> > api.
>> > >> > Each consumer process which is a storm topolofy has 5 streams which
>> > >> > connects to 1 or more partitions. We are not using storm's inbuilt
>> > kafka
>> > >> > spout. Everything runs fine till the 5th consumer process(25
>> streams)
>> > is
>> > >> > added for this topic.
>> > >> >
>> > >> > As soon as the sixth consumer process is added, the newly added
>> > >> partition
>> > >> > does not get the ownership of the partitions that it requests for
>> as
>> > the
>> > >> > already existing owners have not yet given up the ownership.
>> > >> >
>> > >> > We changed certain properties on consumer :
>> > >> >
>> > >> > 1. Max Rebalance attempts - 20 ( rebalance.backoff.ms *
>> > >> > rebalance.max.retries >> zk connection timeout)
>> > >> > 2. Back off ms between rebalances - 10000 (10seconds)
>> > >> > 3. ZK connection timeout - 100,000 (100 seconds)
>> > >> >
>> > >> > Although when I am looking in the zookeeper shell when the
>> rebalance
>> > is
>> > >> > happening, the consumer is registered fine on the zookeeper. Just
>> that
>> > >> the
>> > >> > rebalance does not happen.
>> > >> > After the 20th rebalance gets completed, we get
>> > >> >
>> > >> >
>> > >> > *2014-10-11 11:10:08 k.c.ZookeeperConsumerConnector [INFO]
>> > >> > [rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b], Committing
>> > all
>> > >> > offsets after clearing the fetcher queues*
>> > >> > *2014-10-11 11:10:10 c.s.m.k.i.c.KafkaFeedStreamer [WARN] Ignoring
>> > >> > exception while trying to start streamer threads:
>> > >> > rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b can't
>> rebalance
>> > >> after
>> > >> > 20 retries*
>> > >> > *kafka.common.ConsumerRebalanceFailedException:
>> > >> > rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b can't
>> rebalance
>> > >> after
>> > >> > 20 retries*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:212)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:80)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.messaging.kafka.impl.consumer.KafkaFeedStreamer.createAndStartThreads(KafkaFeedStreamer.java:79)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.messaging.kafka.impl.consumer.KafkaFeedStreamer.startKafkaStreamThreadsIfNecessary(KafkaFeedStreamer.java:64)
>> > >> > ~[stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.messaging.kafka.impl.consumer.KafkaFeedConsumerFactoryImpl.startStreamerIfNotRunning(KafkaFeedConsumerFactoryImpl.java:71)
>> > >> > [stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.messaging.kafka.impl.consumer.KafkaFeedPullConsumerImpl.startStreamerIfNotRunning(KafkaFeedPullConsumerImpl.java:48)
>> > >> > [stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.messaging.kafka.impl.KafkaFeedServiceImpl.getKafkaFeedPullConsumer(KafkaFeedServiceImpl.java:63)
>> > >> > [stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> com.spr.storm.topology.spout.AbstractSprKafkaSpout.nextTuple(AbstractSprKafkaSpout.java:121)
>> > >> > [stormjar.jar:na]*
>> > >> > *        at
>> > >> >
>> > >> >
>> > >>
>> >
>> backtype.storm.daemon.executor$eval3848$fn__3849$fn__3864$fn__3893.invoke(executor.clj:562)
>> > >> > [na:0.9.1-incubating]*
>> > >> > *        at
>> > backtype.storm.util$async_loop$fn__384.invoke(util.clj:433)
>> > >> > [na:0.9.1-incubating]*
>> > >> > *        at clojure.lang.AFn.run(AFn.java:24)
>> [clojure-1.4.0.jar:na]*
>> > >> > *        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]*
>> > >> > *2014-10-11 11:10:10 k.c.ZookeeperConsumerConnector [INFO]
>> > >> > [rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b], begin
>> > >> registering
>> > >> > consumer rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b in
>> ZK*
>> > >> > *2014-10-11 11:10:10 k.u.ZkUtils$ [INFO] conflict in
>> > >> >
>> > >> >
>> > >>
>> >
>> /consumers/rule-engine-feed/ids/rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b
>> > >> > data:
>> > >> >
>> > >> >
>> > >>
>> >
>> {"version":1,"subscription":{"rule-engine-feed":5},"pattern":"static","timestamp":"1413025810635"}
>> > >> > stored data:
>> > >> >
>> > >> >
>> > >>
>> >
>> {"version":1,"subscription":{"rule-engine-feed":5},"pattern":"static","timestamp":"1413025767483"}*
>> > >> > *2014-10-11 11:10:10 k.u.ZkUtils$ [INFO] I wrote this conflicted
>> > >> ephemeral
>> > >> > node
>> > >> >
>> > >> >
>> > >>
>> >
>> [{"version":1,"subscription":{"rule-engine-feed":5},"pattern":"static","timestamp":"1413025810635"}]
>> > >> > at
>> > >> >
>> > >> >
>> > >>
>> >
>> /consumers/rule-engine-feed/ids/rule-engine-feed_ip-10-0-2-170-1413025767369-4679959b
>> > >> > a while back in a different session, hence I will backoff for this
>> > node
>> > >> to
>> > >> > be deleted by Zookeeper and retry*
>> > >> >
>> > >> > Due to this error, none of the consumer consumes from these
>> partitions
>> > >> in
>> > >> > contention which creates a sort of skewed lag on kafka side.  When
>> the
>> > >> 6th
>> > >> > consumer was added, the existing owner process of the partitions in
>> > >> > question did not get rebalanced.
>> > >> >
>> > >> > Any help would be highly appreciated.
>> > >> >
>> > >> > -Thanks,
>> > >> > Mohit
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Rebalance not happening even after increasing max retries causing conflict in ZK

Reply via email to