Thanks for finding this out and sharing this info.

Jun


On Fri, Jul 25, 2014 at 12:09 PM, Kashyap Paidimarri <kashy...@gmail.com>
wrote:

> Solved (We're using Kafka 0.8.1 and this was caused by the bug in dynamic
> topic config changes)
>
> Found the problem.
> We had changed retention.ms for this topic to 100,000 (100 seconds)
> earlier
> this month (using kafka-topics.sh admin tool). Then after Kafka had purged
> data we proceeded to set the retention back to 1209600000ms.
>
> However, I believe the second change wasn't picked up by the brokers.
>
> (See TopicConfigManager.scala) On further debugging, we found that the the
> znode that is added to /config/changes by AdminUtils.scala (via
> kafka-topics.sh) was getting deleted before all the brokers had a chance to
> read its content and update the topic configurations. This deletion was the
> result of a bug in TopicConfigManager.scala which would delete the znode if
> that broker did not have the topic that changed.
>
> The existence of this race condition means that a broker might or might not
> see a dynamic topic configuration change. Until ofcourse the broker is
> bounced at which point it reads the configuration afresh from
> /config/topics.
>
> So this was indeed because of the bug reported in KAFKA-1398. Basically
> dynamic topic config changes are horribly broken in Kafka 0.8.1 and we need
> to move to 0.8.1.1
>
> Sequence of events:
> 1. We set retention.ms = 100000
> 2. Brokers (1,2,3) picked up that change
> 3. We set retention.ms to 1209600000
> 4. One of the other 3 brokers would have picked up the change and gone
> ahead and deleted it because they didn't have the topic
> 5. By the time these brokers reacted, the new znode is no longer there and
> hence they log an error and move on.
>
> In summary:
> DO NOT USE kafka-topics.sh --alter --config <key>=<value> if you're on
> Kafka 0.8.1
>
> If you've used it do verify that all brokers are demonstrating the
> behaviour you expect.
> A symptom of this bug is that if a topic is on all brokers, then the znode
> added to /config/changes will never be removed. That also means that config
> changes to a topic that is on all brokers are safe.
>
>
> On Fri, Jul 25, 2014 at 11:07 AM, Kashyap Paidimarri <kashy...@gmail.com>
> wrote:
>
> > Attached a transcript that explains what I'm seeing
> >
> >
> > On Fri, Jul 25, 2014 at 10:52 AM, Kashyap Paidimarri <kashy...@gmail.com
> >
> > wrote:
> >
> >> No, we haven't configured that. We have a few hundred topics but this
> >> seems to be the only one affected (I did a quick check, not thorough).
> >>
> >> The relevant config params that we have set in server.properties.
> >>
> >> log.dir=/var/lib/fk-3p-kafka/logs
> >> log.flush.interval.messages=10000
> >> log.flush.interval.ms=1000
> >> log.retention.hours=168
> >> log.segment.bytes=536870912
> >> log.cleanup.interval.mins=1
> >> log.retention.hours=336
> >>
> >>
> >>
> >> On Fri, Jul 25, 2014 at 10:11 AM, Jun Rao <jun...@gmail.com> wrote:
> >>
> >>> Have you configured log.retention.bytes?
> >>>
> >>> Thanks,
> >>>
> >>> Jun
> >>>
> >>>
> >>> On Thu, Jul 24, 2014 at 10:04 AM, Kashyap Paidimarri <
> kashy...@gmail.com
> >>> >
> >>> wrote:
> >>>
> >>> > We just noticed that one of our topics has been horribly misbehaving.
> >>> >
> >>> > *retention.ms <http://retention.ms>* for the topic is set to
> >>> 1209600000 ms
> >>> >
> >>> > However, segments are getting schedule for deletetion as soon as a
> new
> >>> one
> >>> > is rolled over. And naturally consumers are running into a
> >>> > kafka.common.OffsetOutOfRangeException whenever this happens.
> >>> >
> >>> > Is this a known bug? It is incredibly serious. We seem to have lost
> >>> about
> >>> > 40 million messages on a single topic and are yet to figure out what
> >>> all
> >>> > topics are affected.
> >>> >
> >>> > I thought of restarting Kafka but figured I'd leave it untouched
> while
> >>> I
> >>> > figure out what I can capture for finding the root cause.
> >>> >
> >>> > Meanwhile in order to keep from losing any more data, I have a
> >>> periodic job
> >>> > that is doing a *'cp -al' *of the partitions into a separate folder.
> >>> That
> >>> > way Kafka goes ahead and deletes the segment but the data is not lost
> >>> from
> >>> > the filesystem.
> >>> >
> >>> > If this is a unseen bug, what should I save from the running
> instance.
> >>> >
> >>> > By the way, this has affected all partitions and replicas of the
> topic
> >>> and
> >>> > not on a specific host.
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> “ The difference between ramen and varelse is not in the creature
> >> judged, but in the creature judging. When we declare an alien species
> to be
> >> ramen, it does not mean that *they* have passed a threshold of moral
> >> maturity. It means that *we* have.
> >>
> >>     —Demosthenes, *Letter to the Framlings*
> >> ”
> >>
> >
> >
> >
> > --
> > “ The difference between ramen and varelse is not in the creature judged,
> > but in the creature judging. When we declare an alien species to be
> ramen,
> > it does not mean that *they* have passed a threshold of moral maturity.
> > It means that *we* have.
> >
> >     —Demosthenes, *Letter to the Framlings*
> > ”
> >
>
>
>
> --
> “ The difference between ramen and varelse is not in the creature judged,
> but in the creature judging. When we declare an alien species to be ramen,
> it does not mean that *they* have passed a threshold of moral maturity. It
> means that *we* have.
>
>     —Demosthenes, *Letter to the Framlings*
> ”
>

Reply via email to