The topic is not compressed. The consumer used our fork of the python lib, which I had to modify to get over the nulls.
-neil On Thu, Nov 6, 2014 at 2:16 PM, Neha Narkhede <neha.narkh...@gmail.com> wrote: > IIRC, the bug that introduced the nulls was related to compressed data. Is > this topic compressed? Did you try to run a consumer through the topic's > data or alternately the DumpLogSegments tool? > > On Thu, Nov 6, 2014 at 12:56 PM, Neil Harkins <nhark...@gmail.com> wrote: >> >> Hi all. I saw something weird yesterday on our "leaf" instances >> which run kafka 0.7.2 (and mirror to kafka 0.8 via our custom code). >> I fully realize everyone's instinctual response is "upgrade, already.", >> but I'd like to have an internals discussion to better understand >> what happened, as I suspect it's still relevant in 0.8. >> >> Basically, in one of our topics there was an 8k stretch of nulls. >> Correlating timestamps from the messages bracketing the nulls >> to the kafka log, I see that the server restarted during that time, >> and here are the recovery lines related to the topic with the nulls: >> >> [2014-11-04 00:48:07,602] INFO zookeeper state changed (SyncConnected) >> (org.I0Itec.zkclient.ZkClient) >> [2014-11-04 01:00:35,806] INFO Shutting down Kafka server >> (kafka.server.KafkaServer) >> [2014-11-04 01:00:35,813] INFO shutdown scheduler kafka-logcleaner- >> (kafka.utils.KafkaScheduler) >> [2014-11-04 01:01:38,411] INFO Starting Kafka server... >> (kafka.server.KafkaServer) >> ... >> [2014-11-04 01:01:49,146] INFO Loading log 'foo.bar-0' >> (kafka.log.LogManager) >> [2014-11-04 01:01:49,147] INFO Loading the last segment >> /var/kafka-leaf-spool/foo.bar-0/00000000002684355423.kafka in mutable >> mode, recovery true (kafka.log.Log) >> [2014-11-04 01:01:55,877] INFO recover high water mark:414004449 >> (kafka.message.FileMessageSet) >> [2014-11-04 01:01:55,877] INFO Recovery succeeded in 6 seconds. 0 >> bytes truncated. (kafka.message.FileMessageSet) >> >> The only hypothesis I can come up with is that the shutdown >> ("graceful"?) did not wait for all messages to flush to disk >> (we're configured with: log.flush.interval=10000, >> log.default.flush.interval.ms=500, and >> log.default.flush.scheduler.interval.ms=500), >> but the max offset was recorded, so that when it came back up, >> it filled the gap with nulls to reach the valid max offset in case >> any consumers were at the end. >> >> But for consumers with a position prior to all the nulls, >> are they guaranteed to get back "on the rails" so-to-speak? >> Nulls appear as v0(i.e. "magic") messages of 0 length, >> but the messages replaced could be variable length. >> >> Thanks in advance for any input, >> -neil > >