In our case unclean leader selection was enabled As the cluster should have been empty I can't really say that we did not lose any data but as I wrote earlier, I could not get the log messages to stop until I took down all brokers at the same time.
2015-02-05 22:16 GMT+01:00 Kyle Banker <kyleban...@gmail.com>: > Thanks for sharing, svante. We're also running 0.8.2. > > Our cluster appears to be completely unusable at this point. We tried > restarting the "down" broker with a clean log directory, and it's doing > nothing. It doesn't seem to be able to get topic data, which this Zookeeper > message appears to confirm: > > [ProcessThread(sid:5 cport:-1)::PrepRequestProcessor@645] - Got user-level > KeeperException when processing sessionid:0x54b0e251a5cd0ec type:setData > cxid:0x2b7ab zxid:0x100b9ad88 txntype:-1 reqpath:n/a Error > Path:/brokers/topics/mytopic/partitions/143/state Error:KeeperErrorCode = > BadVersion for /brokers/topics/mytopic/partitions/143/state > > It's probably worthwhile to note that we've disabled unclean leader > election. > > > > On Thu, Feb 5, 2015 at 2:01 PM, svante karlsson <s...@csi.se> wrote: > > > I believe I've had the same problem on the 0.8.2 rc2. We had a idle test > > cluster with unknown health status and I applied rc3 without checking if > > everything was ok before. Since that cluster had been doing nothing for a > > couple of days and the retention time was 48 hours it's reasonable to > > assume that no actual data was left on the cluster. The same type of logs > > was emitted in big amounts and never stopped. I then rebooted each > > zookeeper in series. No change, Then bumped each broker - no change, > > Finally I took down all brokers at the same time. > > > > The logging stopped but then one broker did not have any partitions in > > sync, including the the internal consumer offset topic that was living > > (with replicas=1) on that broker. I then bumped this broker once more and > > then my whole cluster became in sync. > > > > I suspect that something related to 0 size topics caused this since the > the > > cluster worked fine the week before during testing and also after during > > more testing with rc3. > > > > > > > > > > > > > > > > 2015-02-05 19:22 GMT+01:00 Kyle Banker <kyleban...@gmail.com>: > > > > > Digging in a bit more, it appears that the "down" broker had likely > > > partially failed. Thus, it was still attempting to fetch offsets that > no > > > longer exists. Does this make sense as an explanation of the > > > above-mentioned behavior? > > > > > > On Thu, Feb 5, 2015 at 10:58 AM, Kyle Banker <kyleban...@gmail.com> > > wrote: > > > > > > > Dug into this a bit more, and it turns out that we lost one of our 9 > > > > brokers at the exact moment when this started happening. At the time > > that > > > > we lost the broker, we had no under-replicated partitions. Since the > > > broker > > > > disappeared, we've had a fairly constant number of under replicated > > > > partitions. This makes some sense, of course. > > > > > > > > Still, the log message doesn't. > > > > > > > > On Thu, Feb 5, 2015 at 10:39 AM, Kyle Banker <kyleban...@gmail.com> > > > wrote: > > > > > > > >> I have a 9-node Kafka cluster, and all of the brokers just started > > > >> spouting the following error: > > > >> > > > >> ERROR [Replica Manager on Broker 1]: Error when processing fetch > > request > > > >> for partition [mytopic,57] offset 0 from follower with correlation > id > > > >> 58166. Possible cause: Request for offset 0 but we only have log > > > segments > > > >> in the range 39 to 39. (kafka.server.ReplicaManager) > > > >> > > > >> The "mytopic" topic has a replication factor of 3, and metrics are > > > >> showing a large number of under replicated partitions. > > > >> > > > >> My assumption is that a log aged out but that the replicas weren't > > aware > > > >> of it. > > > >> > > > >> In any case, this problem isn't fixing itself, and the volume of log > > > >> messages of this type is enormous. > > > >> > > > >> What might have caused this? How does one resolve it? > > > >> > > > > > > > > > > > > > >