IIRC, the bug that introduced the nulls was related to compressed data. Is this topic compressed? Did you try to run a consumer through the topic's data or alternately the DumpLogSegments tool?
On Thu, Nov 6, 2014 at 12:56 PM, Neil Harkins <nhark...@gmail.com> wrote: > Hi all. I saw something weird yesterday on our "leaf" instances > which run kafka 0.7.2 (and mirror to kafka 0.8 via our custom code). > I fully realize everyone's instinctual response is "upgrade, already.", > but I'd like to have an internals discussion to better understand > what happened, as I suspect it's still relevant in 0.8. > > Basically, in one of our topics there was an 8k stretch of nulls. > Correlating timestamps from the messages bracketing the nulls > to the kafka log, I see that the server restarted during that time, > and here are the recovery lines related to the topic with the nulls: > > [2014-11-04 00:48:07,602] INFO zookeeper state changed (SyncConnected) > (org.I0Itec.zkclient.ZkClient) > [2014-11-04 01:00:35,806] INFO Shutting down Kafka server > (kafka.server.KafkaServer) > [2014-11-04 01:00:35,813] INFO shutdown scheduler kafka-logcleaner- > (kafka.utils.KafkaScheduler) > [2014-11-04 01:01:38,411] INFO Starting Kafka server... > (kafka.server.KafkaServer) > ... > [2014-11-04 01:01:49,146] INFO Loading log 'foo.bar-0' > (kafka.log.LogManager) > [2014-11-04 01:01:49,147] INFO Loading the last segment > /var/kafka-leaf-spool/foo.bar-0/00000000002684355423.kafka in mutable > mode, recovery true (kafka.log.Log) > [2014-11-04 01:01:55,877] INFO recover high water mark:414004449 > (kafka.message.FileMessageSet) > [2014-11-04 01:01:55,877] INFO Recovery succeeded in 6 seconds. 0 > bytes truncated. (kafka.message.FileMessageSet) > > The only hypothesis I can come up with is that the shutdown > ("graceful"?) did not wait for all messages to flush to disk > (we're configured with: log.flush.interval=10000, > log.default.flush.interval.ms=500, and > log.default.flush.scheduler.interval.ms=500), > but the max offset was recorded, so that when it came back up, > it filled the gap with nulls to reach the valid max offset in case > any consumers were at the end. > > But for consumers with a position prior to all the nulls, > are they guaranteed to get back "on the rails" so-to-speak? > Nulls appear as v0(i.e. "magic") messages of 0 length, > but the messages replaced could be variable length. > > Thanks in advance for any input, > -neil >