Re: Problem with node after restart no partitions?

Zhao Weinan Thu, 20 Aug 2015 08:27:36 -0700

Hi,

I've been through some similar problems, posted in anthor thread
<http://mail-archives.apache.org/mod_mbox/kafka-users/201508.mbox/%3CCAJY8EoeqCpxFirQzLYiKk5Gz4bqQrVv3Z_WGx8T5qPnk3VMAbA%40mail.gmail.com%3E>(I
just pasted some broker logs, please check the latest message which the
mail archiever did't catch up), and I checked the KAFKA-2108 that is
closed, is there some conclusions?


If someone need more detail logs I think I got all of them.

2015-04-09 1:52 GMT+08:00 Thunder Stumpges <tstump...@ntent.com>:

> Created issue https://issues.apache.org/jira/browse/KAFKA-2108
>
> Cheers,
> Thunder
>
> -----Original Message-----
> From: Jason Rosenberg [mailto:j...@squareup.com]
> Sent: Tuesday, April 07, 2015 12:05 PM
> To: users@kafka.apache.org
> Subject: Re: Problem with node after restart no partitions?
>
> Thunder, thanks for the detailed info.  I can confirm that our incident
> had the same (or similar) sequence of messages, when the first upgraded
> broker restarted (after having undergone an unclean shutdown).
>
> I think it makes sense at this point, to file a jira issue to track it.
>  (Could mostly just copy paste the details in this thread).  I can then
> add my notes to the ticket (or maybe file a separate one).
>
> My main question at this point, is whether the issue is related
> specifically to an unclean shutdown during an upgrade to 0.8.2.1 (or if
> it's just unclean shutdowns in general with 0.8.1.1+ not recovering
> properly).  Any thoughts jiangjie?
>
> Jason
>
> On Mon, Apr 6, 2015 at 7:30 PM, Thunder Stumpges <tstump...@ntent.com>
> wrote:
>
> > Hi all,
> >
> > I have collected our logs and paired them down. Jiangjie, I sent you
> > more complete logs directly to your mailto from the list (minus the
> .INVALID).
> > Please let me know if you did not get them. I cannot post them here in
> > public.
> >
> > For others, and to continue the conversation, during the initial
> > shutdown we failed a clean shutdown due to one corrupt topic. The
> > message in server.log was:
> >
> > [2015-03-31 10:21:46,250] INFO [Kafka Server 6], Remaining partitions
> > to
> > move: [__samza_checkpoint_ver_1_for_usersessions_1,0]
> > (kafka.server.KafkaServer)
> > [2015-03-31 10:21:46,250] INFO [Kafka Server 6], Error code from
> > controller: 0 (kafka.server.KafkaServer)
> >
> > And related message in state-change.log:
> >
> > [2015-03-31 10:21:42,622] TRACE Controller 6 epoch 23 started leader
> > election for partition [__samza_checkpoint_ver_1_for_usersessions_1,0]
> > (state.change.logger)
> > [2015-03-31 10:21:42,623] ERROR Controller 6 epoch 23 encountered
> > error while electing leader for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] due to: LeaderAndIsr
> > information doesn't exist for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] in OnlinePartition state.
> > (state.change.logger)
> > [2015-03-31 10:21:42,623] TRACE Controller 6 epoch 23 received
> > response correlationId 2360 for a request sent to broker
> > id:8,host:xxxxxxx,port:9092
> > (state.change.logger)
> > [2015-03-31 10:21:42,623] ERROR Controller 6 epoch 23 initiated state
> > change for partition [__samza_checkpoint_ver_1_for_usersessions_1,0]
> > from OnlinePartition to OnlinePartition failed (state.change.logger)
> > kafka.common.StateChangeFailedException: encountered error while
> > electing leader for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] due
> > to: LeaderAndIsr information doesn't exist for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] in OnlinePartition state.
> >         at
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:360)
> >         at
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:187)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:125)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:124)
> >         at scala.collection.immutable.Set$Set1.foreach(Set.scala:86)
> >         at
> >
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:124)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:257)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1$$anonfun$apply$mcV$sp$3.apply(KafkaController.scala:253)
> >         at scala.Option.foreach(Option.scala:197)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply$mcV$sp(KafkaController.scala:253)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3$$anonfun$apply$1.apply(KafkaController.scala:253)
> >         at kafka.utils.Utils$.inLock(Utils.scala:538)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:252)
> >         at
> >
> kafka.controller.KafkaController$$anonfun$shutdownBroker$3.apply(KafkaController.scala:249)
> >         at
> > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:130)
> >         at
> > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275)
> >         at
> > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:275)
> >         at
> >
> kafka.controller.KafkaController.shutdownBroker(KafkaController.scala:249)
> >         at
> >
> kafka.server.KafkaApis.handleControlledShutdownRequest(KafkaApis.scala:264)
> >         at kafka.server.KafkaApis.handle(KafkaApis.scala:192)
> >         at
> > kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
> >         at java.lang.Thread.run(Thread.java:744)
> > Caused by: kafka.common.StateChangeFailedException: LeaderAndIsr
> > information doesn't exist for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] in OnlinePartition state
> >         at
> >
> kafka.controller.PartitionStateMachine.getLeaderIsrAndEpochOrThrowException(PartitionStateMachine.scala:391)
> >         at
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:324)
> >         ... 22 more
> >
> >
> > Then when starting up, we got many angry messages like these:
> >
> > [2015-03-31 10:24:10,960] WARN [Replica Manager on Broker 6]: Fetch
> > request with correlation id 159883938 from client
> > ReplicaFetcherThread-0-6 on partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] failed due to
> > Partition [__samza_checkpoint_ver_1_for_usersessions_1,0] doesn't
> > exist on
> > 6 (kafka.server.ReplicaManager) [2015-03-31 10:24:10,989] WARN
> > [Replica Manager on Broker 6]: While recording the follower position,
> > the partition [__samza_checkpoint_ver_1_for_usersessions_1,0] hasn't
> > been created, skip updating leader HW (kafka.server.ReplicaManager)
> > [2015-03-31 10:24:10,989] WARN [Replica Manager on Broker 6]: While
> > recording the follower position, the partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] hasn't been created,
> > skip updating leader HW (kafka.server.ReplicaManager)
> >
> > Then they change to these:
> >
> > [2015-03-31 10:24:12,008] WARN [Replica Manager on Broker 6]: Fetch
> > request with correlation id 159884643 from client
> > ReplicaFetcherThread-0-6 on partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] failed due to
> > Partition [__samza_checkpoint_ver_1_for_usersessions_1,0] doesn't
> > exist on
> > 6 (kafka.server.ReplicaManager)
> > [2015-03-31 10:24:12,009] WARN [Replica Manager on Broker 6]: While
> > recording the follower position, the partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] hasn't been created,
> > skip updating leader HW (kafka.server.ReplicaManager)
> > [2015-03-31 10:24:12,009] WARN [Replica Manager on Broker 6]: Fetch
> > request with correlation id 158861852 from client
> > ReplicaFetcherThread-0-6 on partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] failed due to
> > Partition [__samza_checkpoint_ver_1_for_usersessions_1,0] doesn't
> > exist on
> > 6 (kafka.server.ReplicaManager)
> > [2015-03-31 10:24:12,009] WARN [Replica Manager on Broker 6]: While
> > recording the follower position, the partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0] hasn't been created,
> > skip updating leader HW (kafka.server.ReplicaManager)
> >
> > And a bunch like this:
> >
> >
> > [2015-03-31 10:24:12,019] ERROR [KafkaApi-6] error when handling
> > request
> > Name: FetchRequest; Version: 0; CorrelationId: 158861855; ClientId:
> > ReplicaFetcherThread-0-6; ReplicaId: 8; MaxWait: 500 ms; MinBytes: 1
> > bytes;
> > RequestInfo: [__samza_checkpoint_ver_1_for_usersessions_1,0] ->
> > PartitionFetchInfo(4461,1048576) (kafka.server.KafkaApis)
> > kafka.common.NotAssignedReplicaException: Leader 6 failed to record
> > follower 8's position -1 since the replica is not recognized to be one
> > of the assigned replicas  for partition
> > [__samza_checkpoint_ver_1_for_usersessions_1,0]
> >         at
> >
> kafka.server.ReplicaManager.updateReplicaLEOAndPartitionHW(ReplicaManager.scala:574)
> >         at
> >
> kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:388)
> >         at
> >
> kafka.server.KafkaApis$$anonfun$recordFollowerLogEndOffsets$2.apply(KafkaApis.scala:386)
> >         at
> >
> scala.collection.immutable.MapLike$$anon$2$$anonfun$foreach$3.apply(MapLike.scala:109)
> >         at
> >
> scala.collection.immutable.MapLike$$anon$2$$anonfun$foreach$3.apply(MapLike.scala:109)
> >         at scala.collection.immutable.Map$Map1.foreach(Map.scala:119)
> >         at
> > scala.collection.immutable.MapLike$$anon$2.foreach(MapLike.scala:109)
> >         at
> > kafka.server.KafkaApis.recordFollowerLogEndOffsets(KafkaApis.scala:386)
> >         at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:351)
> >         at kafka.server.KafkaApis.handle(KafkaApis.scala:60)
> >         at
> > kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
> >         at java.lang.Thread.run(Thread.java:744)
> >
> >
> > And it starts truncating the logs (not to 0, guessing as part of
> > becoming
> > follower?) ...
> > [2015-03-31 10:24:15,897] INFO [ReplicaFetcherManager on broker 6]
> > Removed fetcher for partitions
> > [my-store-changelog,1],[stage_vsw.avrodto.addelivery.internalstats.tra
> > fficshaperlog,3],+ a bunch of partitions...
> > (kafka.server.ReplicaFetcherManager)
> > [2015-03-31 10:24:16,013] INFO Truncating log my-store-changelog-1 to
> > offset 3736. (kafka.log.Log)
> > [2015-03-31 10:24:16,014] INFO Truncating log xyz.topic1-3 to offset
> 2930.
> > (kafka.log.Log)
> >  [2015-03-31 10:24:16,014] INFO Truncating log xyz.topic2-2 to offset
> > 10176. (kafka.log.Log)
> >
> > This continues for many / all partitions, then... It starts deleting
> logs!?
> >
> > [2015-03-31 10:24:38,449] INFO [ReplicaFetcherManager on broker 6]
> > Removed fetcher for partitions [my-store-changelog,1],[xyz.topic1,1],...
> > (kafka.server.ReplicaFetcherManager)
> > [2015-03-31 10:24:40,700] INFO Scheduling log segment 6587 for log
> > xyz.topic1-2 for deletion. (kafka.log.Log)
> > [2015-03-31 10:24:40,701] INFO Scheduling log segment 6594 for log
> > xyz.topic1-2 for deletion. (kafka.log.Log)
> > [2015-03-31 10:24:40,701] INFO Scheduling log segment 6595 for log
> > xyz.topic1-2 for deletion. (kafka.log.Log)
> > [2015-03-31 10:24:40,702] INFO Scheduling log segment 203953 for log
> > xyz.topic2-4 for deletion. (kafka.log.Log)
> > [2015-03-31 10:24:40,702] INFO Scheduling log segment 210571 for log
> > xyz.topic2-4 for deletion. (kafka.log.Log)
> > [2015-03-31 10:24:40,702] INFO Scheduling log segment 211471 for log
> > xyz.topic2-4 for deletion. (kafka.log.Log)
> >
> > then it starts actually deleting them... this goes on for a good 20
> > minutes...
> >  [2015-03-31 10:25:40,704] INFO Deleting segment 6587 from log
> > xyz.topic1-2. (kafka.log.Log)
> > [2015-03-31 10:25:40,716] INFO Deleting index
> > /data4/kafka-data/xyz.topic1-2/00000000000000006587.index.deleted
> > (kafka.log.OffsetIndex)
> > [2015-03-31 10:25:40,716] INFO Deleting segment 6594 from log
> > xyz.topic1-2. (kafka.log.Log)
> > [2015-03-31 10:25:40,717] INFO Deleting index
> > /data4/kafka-data/xyz.topic1-2/00000000000000006594.index.deleted
> > (kafka.log.OffsetIndex)
> > [2015-03-31 10:25:40,717] INFO Deleting segment 6595 from log
> > xyz.topic1-2. (kafka.log.Log)
> > [2015-03-31 10:25:40,717] INFO Deleting index
> > /data4/kafka-data/xyz.topic1-2/00000000000000006595.index.deleted
> > (kafka.log.OffsetIndex)
> > [2015-03-31 10:25:40,717] INFO Deleting segment 203953 from log
> > xyz.topic2-4. (kafka.log.Log)
> > [2015-03-31 10:25:40,722] INFO Deleting segment 210571 from log
> > xyz.topic2-4. (kafka.log.Log)
> > [2015-03-31 10:25:40,729] INFO Deleting index
> > /data4/kafka-data/xyz.topic2-4/00000000000000203953.index.deleted
> > (kafka.log.OffsetIndex)
> > [2015-03-31 10:25:40,729] INFO Deleting segment 211471 from log
> > xyz.topic2-4. (kafka.log.Log)
> >
> > I don't know that we knew what was happening exactly at this time,
> > only that it was not sync'ing up with the others. I think the
> > sys-engineer stopped it after about 20 minutes to see what was
> > wrong... I think by this point the damage was done. And actually in
> > these logs, I don't see any of the "No checkpointed highwatermark is
> > found for partition" messages in this sequence though, not sure what's
> up.
> >
> > Any ideas? It makes us more than a little nervous to restart nodes if
> > they will just "decide" to delete segments. Under what conditions
> > would this happen?
> >
> > Thanks!
> > Thunder
> >
> >
> > -----Original Message-----
> > From: Thunder Stumpges [mailto:tstump...@ntent.com]
> > Sent: Friday, April 03, 2015 12:10 PM
> > To: users@kafka.apache.org
> > Subject: RE: Problem with node after restart no partitions?
> >
> > Likewise, I was not "at the wheel" when this was happening, and there
> > very well could have been a similar situation of not waiting for a
> > controlled shutdown to complete successfully.
> >
> > Fortunately we did not end up in exactly your situation where the
> > entire cluster went down, but I can say I know we never had more than
> > 1 of the three nodes shut down during this situation, and twice in two
> > days we lost all data on the node and it had to re-sync ALL of its data
> over again.
> >
> > I am in the process of trying to pull the controller and server logs
> > for the first portion of our issue right now. Will follow up when they
> > are available.
> >
> > Thanks,
> > Thunder
> >
> >
> > -----Original Message-----
> > From: Jason Rosenberg [mailto:j...@squareup.com]
> > Sent: Friday, April 03, 2015 10:50 AM
> > To: users@kafka.apache.org
> > Subject: Re: Problem with node after restart no partitions?
> >
> > I will provide what I can (we don't have separate logs for controller,
> > etc., it's all integrated in a single log with log4j for us, we embed
> > Kafka in a java container, etc).  Are there specific log classes you'd
> > be interested in seeing?  (I can look at the default log4j configs to
> > see what's set up normally for the 'controller' log)....
> >
> > We have most of the logs saved away (but not all of them).
> >
> > Sorry, I didn't mean to hijack this thread (so will start a new thread
> > soonly).....
> >
> > Jason
> >
> > On Fri, Apr 3, 2015 at 1:37 PM, Jiangjie Qin
> > <j...@linkedin.com.invalid>
> > wrote:
> >
> > > This sounds a very serious issueŠ Could you provide the controller
> > > log and the log for the first broker on which you tried controlled
> > > shutdown and upgrade?
> > >
> > > On 4/3/15, 8:57 AM, "Jason Rosenberg" <j...@squareup.com> wrote:
> > >
> > > >I'm preparing a longer post here, but we recently ran into a
> > > >similar scenario.  Not sure yet if it's the same thing you saw (but
> > > >it feels similar).  We were also doing a rolling upgrade from
> > > >0.8.1.1 to 0.8.2.1, and during the controlled shutdown of the first
> > > >node (of a 4 node cluster), the controlled shutdown was taking
> > > >longer than normal (it timed out several times and was retrying
> > > >controlled shutdown), and unfortunately, our deployment system
> > > >decided to kill it hard (so it was in the middle of it's 4th
> controlled shutdown retry, etc.).
> > > >
> > > >Anyway, when the node came back, it naturally decided to 'restore'
> > > >most of it's partitions, which took some time (but only like 5
> > > >minutes).  What's weird is it didn't decide to resync data from
> > > >other replicas, instead it just restored partitions locally.
> > > >During this time, the rest of the cluster failed to elect any new
> > > >leaders, and so for 5 minutes, those partitions were unavailable
> > > >(and we saw a flood of failed FetcherManager exceptions from the
> > > >other nodes in the cluster).  Most of the partitions were empty
> > > >(e.g. there's no way the other replicas were behind and not in the
> > > >ISR normally).  During this
> > > >5 minutes, producers were unable to send messages due to
> > > >NotLeaderForPartition exceptions.  Apparently the controller was
> > > >still
> > sending them to the unavailable broker.
> > > >
> > > >Finally, when the first node finally came up, the other nodes were
> > > >somewhat happy again (but a few partitions remained
> > > >under-replicated indefinitely).
> > > >Because of this, we decided to pause the rolling restart, and try
> > > >to wait for the under-replicated partitions to get insync.
> > > >Unfortunately, about an hour later, the whole cluster went foobar
> > > >(e.g. partitions became unavailable, brokers logged a flood of
> > > >Fetcher errors, producers couldn't find a valid leader, metadata
> > > >requests timed out, etc.).  In a panic, we reverted that first node
> > > >back to 0.8.1.1. This did not help, unfortunately, so, deciding
> > > >we'd already probably lost data at this point (and producers could
> > > >not send data due to (NotLeaderForPartition exceptions)), we
> > > >decided to just forcibly do the upgrade to 0.8.2.1.  This was all a
> > > >bad situation, of course.
> > > >
> > > >So, now we have the cluster stable at 0.8.2.1, but like you, we are
> > > >very, very nervous about doing any kind of restart to any of our
> > > >nodes.  We lost data, primarily in the form of producers failing to
> > > >send during the periods of unavailability.
> > > >
> > > >It looks like the root cause, in our case, was a flood of topics
> > > >created (long-since unused and empty).  This appears to have caused
> > > >the longer than normal controlled shutdown, which in turn, led to
> > > >the followon problems.
> > > >However, in the past, we've seen a controlled shutdown failure
> > > >result in an unclean shutdown, but usually the cluster recovers
> > > >(e.g. it elects new leaders, and when the new node comes back, it
> > > >recovers it's partitions that were uncleanly shutdown).  That did
> > > >not happen this time (the rest of the cluster got in an apparent
> > > >infinite loop where it tried repeatedly (e.g.
> > > >500K times a minute) to fetch partitions that were unavailable).
> > > >
> > > >I'm preparing a longer post with more detail (will take a bit of
> time).
> > > >
> > > >Jason
> > > >
> > > >On Thu, Apr 2, 2015 at 10:19 PM, Gwen Shapira
> > > ><gshap...@cloudera.com>
> > > >wrote:
> > > >
> > > >> wow, thats scary for sure.
> > > >>
> > > >> Just to be clear - all you did is restart *one* broker in the
> cluster?
> > > >> everything else was ok before the restart? and that was
> > > >> controlled shutdown?
> > > >>
> > > >> Gwen
> > > >>
> > > >> On Wed, Apr 1, 2015 at 11:54 AM, Thunder Stumpges
> > > >> <tstump...@ntent.com>
> > > >> wrote:
> > > >>
> > > >> > Well it appears we lost all the data on the one node again. It
> > > >>appears to
> > > >> > be all or part of KAFKA-1647<
> > > >> > https://issues.apache.org/jira/browse/KAFKA-1647> as we saw
> > > >> > this in
> > > >>our
> > > >> > logs (for all topics):
> > > >> >
> > > >> > [2015-04-01 10:46:58,901] WARN Partition
> > > >> > [logactivity-redirect,3] on broker 6: No checkpointed
> > > >> > highwatermark is found for partition [logactivity-redirect,3]
> > > >> > (kafka.cluster.Partition)
> > > >> > [2015-04-01 10:46:58,902] WARN Partition [pageimpression,1] on
> > > >> > broker
> > > >>6:
> > > >> > No checkpointed highwatermark is found for partition
> > > >>[pageimpression,1]
> > > >> > (kafka.cluster.Partition)
> > > >> > [2015-04-01 10:46:58,904] WARN Partition
> > > >> > [campaignplatformtarget,6] on broker 6: No checkpointed
> > > >> > highwatermark is found for partition [campaignplatformtarget,6]
> > > >> > (kafka.cluster.Partition)
> > > >> > [2015-04-01 10:46:58,905] WARN Partition [trackingtags-c2,1] on
> > > >>broker 6:
> > > >> > No checkpointed highwatermark is found for partition
> > > >>[trackingtags-c2,1]
> > > >> > (kafka.cluster.Partition)
> > > >> >
> > > >> > Followed by:
> > > >> >
> > > >> > [2015-04-01 10:46:58,911] INFO Truncating log
> > > >> > trafficshaperlog-3 to
> > > >> offset
> > > >> > 0. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:58,928] INFO Truncating log videorecrequest-0
> > > >> > to
> > > >>offset
> > > >> > 0. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:58,928] INFO Truncating log
> > > >> > filteredredirect-2 to
> > > >> offset
> > > >> > 0. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:58,985] INFO Truncating log precheckrequest-3
> > > >> > to
> > > >>offset
> > > >> > 0. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:58,990] INFO Truncating log
> > > >> > filteredclicklog-8 to
> > > >> offset
> > > >> > 0. (kafka.log.Log)
> > > >> >
> > > >> > Followed by:
> > > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90276704
> > > >> > for log
> > > >> > pageview-0 for deletion. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90593329
> > > >> > for log
> > > >> > pageview-0 for deletion. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90906048
> > > >> > for log
> > > >> > pageview-0 for deletion. (kafka.log.Log)
> > > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 91226773
> > > >> > for log
> > > >> > pageview-0 for deletion. (kafka.log.Log)
> > > >> >
> > > >> >
> > > >> > The strange thing however is that I don't believe we ever had
> > > >> > the
> > > >> scenario
> > > >> > mentioned in the bug (all brokers for a topic down) As we have
> > > >> > 0
> > > >>unclean
> > > >> > leader elections, our applications never complained the cluster
> > > >> > was
> > > >>down,
> > > >> > and we never actually stopped more than one node (the node in
> > > >>question).
> > > >> >
> > > >> > Anyway, really unsure if this could have somehow been related
> > > >> > to the attempted upgrade to 0.8.2.1 (the occurrence this
> > > >> > morning was NOT, it
> > > >> was a
> > > >> > simple stop, restart broker)
> > > >> >
> > > >> > We are really nervous to do any kafka service restarts now that
> > > >> > this
> > > >>has
> > > >> > happened twice on this machine.
> > > >> > Any suggestions? Should we go back and make another attempt to
> > > >>upgrade to
> > > >> > 0.8.2.1 ?
> > > >> >
> > > >> > Thanks,
> > > >> > Thunder
> > > >> >
> > > >> >
> > > >> > From: Thunder Stumpges
> > > >> > Sent: Wednesday, April 01, 2015 10:30 AM
> > > >> > To: kafka-us...@incubator.apache.org
> > > >> > Subject: Problem with node after restart no partitions?
> > > >> >
> > > >> > Hi group,
> > > >> >
> > > >> > We've been trying to track down a problem this morning for a
> > > >> > little
> > > >> while,
> > > >> > and thought I'd ask here while we keep looking.
> > > >> >
> > > >> > We have 3 nodes (rep-3) running 8.1.1. We attempted a rolling
> > > >> > upgrade yesterday to 8.2.1, and on the first node, after
> > > >> > restarting, a single
> > > >> topic
> > > >> > (a samza intermediate topic) started throwing replica fetcher
> > > >> > errors
> > > >>over
> > > >> > and over ("NotLeaderForPartition"). There may or may not have
> > > >> > been
> > > >>other
> > > >> > things attempted at this time (not by me so I cannot say for
> sure).
> > > >> Anyway
> > > >> > we ended up rolling back to 8.1.1 and ALL data had been DELETED
> > > >> > from
> > > >>that
> > > >> > node. It spent most of yesterday re-syncing, and came into sync
> > > >> > last
> > > >> night,
> > > >> > and a rebalance made everything run smoothly (*except for these
> > > >> > damn replica fetcher errors for that one partition).
> > > >> >
> > > >> > Today my colleague attempted the "unsupported" topic delete
> > > >> > command
> > > >>for
> > > >> > the "bad" partition, and bounced that one troublesome node.
> > > >> >
> > > >> > Upon coming up, I can see in server.log that it is reading in
> > > >> > all of
> > > >>the
> > > >> > segments in, and then starts spitting out a samza topic fetch
> > > >> > error,
> > > >>and
> > > >> > through JMX the "ReplicaManager".LeaderCount is 0. It is not
> > > >>attempting
> > > >> to
> > > >> > fetch or load any topics.
> > > >> >
> > > >> > The other two brokers are showing under-replicated (obviously).
> > > >> > What
> > > >>is
> > > >> > going wrong? How can we get that samza topic really and truly
> gone?
> > > >>(if
> > > >> > that is the cause of the broker not coming up)
> > > >> >
> > > >> > Thanks,
> > > >> > Thunder
> > > >> >
> > > >> >
> > > >>
> > >
> > >
> >
>

Re: Problem with node after restart no partitions?

Reply via email to