Thanks for sharing, svante. We're also running 0.8.2.

Our cluster appears to be completely unusable at this point. We tried
restarting the "down" broker with a clean log directory, and it's doing
nothing. It doesn't seem to be able to get topic data, which this Zookeeper
message appears to confirm:

[ProcessThread(sid:5 cport:-1)::PrepRequestProcessor@645] - Got user-level
KeeperException when processing sessionid:0x54b0e251a5cd0ec type:setData
cxid:0x2b7ab zxid:0x100b9ad88 txntype:-1 reqpath:n/a Error
Path:/brokers/topics/mytopic/partitions/143/state Error:KeeperErrorCode =
BadVersion for /brokers/topics/mytopic/partitions/143/state

It's probably worthwhile to note that we've disabled unclean leader
election.



On Thu, Feb 5, 2015 at 2:01 PM, svante karlsson <s...@csi.se> wrote:

> I believe I've had the same problem on the 0.8.2 rc2. We had a idle test
> cluster with unknown health status and I applied rc3 without checking if
> everything was ok before. Since that cluster had been doing nothing for a
> couple of days and the retention time was 48 hours it's reasonable to
> assume that no actual data was left on the cluster. The same type of logs
> was emitted in big amounts and never stopped. I then rebooted each
> zookeeper in series. No change, Then bumped each broker - no change,
> Finally I took down all brokers at the same time.
>
> The logging stopped but then one broker did not have any partitions in
> sync, including the the internal consumer offset topic that was living
> (with replicas=1) on that broker. I then bumped this broker once more and
> then my whole cluster became in sync.
>
> I suspect that something related to 0 size topics caused this since the the
> cluster worked fine the week before during testing and also after during
> more testing with rc3.
>
>
>
>
>
>
>
> 2015-02-05 19:22 GMT+01:00 Kyle Banker <kyleban...@gmail.com>:
>
> > Digging in a bit more, it appears that the "down" broker had likely
> > partially failed. Thus, it was still attempting to fetch offsets that no
> > longer exists. Does this make sense as an explanation of the
> > above-mentioned behavior?
> >
> > On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <kyleban...@gmail.com>
> wrote:
> >
> > > Dug into this a bit more, and it turns out that we lost one of our 9
> > > brokers at the exact moment when this started happening. At the time
> that
> > > we lost the broker, we had no under-replicated partitions. Since the
> > broker
> > > disappeared, we've had a fairly constant number of under replicated
> > > partitions. This makes some sense, of course.
> > >
> > > Still, the log message doesn't.
> > >
> > > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <kyleban...@gmail.com>
> > wrote:
> > >
> > >> I have a 9-node Kafka cluster, and all of the brokers just started
> > >> spouting the following error:
> > >>
> > >> ERROR [Replica Manager on Broker 1]: Error when processing fetch
> request
> > >> for partition [mytopic,57] offset 0 from follower with correlation id
> > >> 58166. Possible cause: Request for offset 0 but we only have log
> > segments
> > >> in the range 39 to 39. (kafka.server.ReplicaManager)
> > >>
> > >> The "mytopic" topic has a replication factor of 3, and metrics are
> > >> showing a large number of under replicated partitions.
> > >>
> > >> My assumption is that a log aged out but that the replicas weren't
> aware
> > >> of it.
> > >>
> > >> In any case, this problem isn't fixing itself, and the volume of log
> > >> messages of this type is enormous.
> > >>
> > >> What might have caused this? How does one resolve it?
> > >>
> > >
> > >
> >
>

Reply via email to