Re: kafka broker loosing offsets?

Dmitriy Vsekhvalnov Wed, 11 Oct 2017 08:48:44 -0700

Yeah just pops up in my list. Thanks, i'll take a look.

Vincent Dautremont, if you still reading it, did you try upgrade to
0.11.0.1? Fixed issue?


On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <ben.davi...@7digital.com>
wrote:

> Hi Dmitriy,
>
> Did you check out this thread "Incorrect consumer offsets after broker
> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
>
> Thanks,
>
> Ben
>
> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
> dvsekhval...@gmail.com>
> wrote:
>
> > Hey, want to resurrect this thread.
> >
> > Decided to do idle test, where no load data is produced to topic at all.
> > And when we kill #101 or #102 - nothing happening. But when we kill #200
> -
> > consumers starts to re-consume old events from random position.
> >
> > Anybody have ideas what to check?  I really expected that Kafka will fail
> > symmetrical with respect to any broker.
> >
> > On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> > dvsekhval...@gmail.com>
> > wrote:
> >
> > > Hi tao,
> > >
> > > we had unclean leader election enabled at the beginning. But then
> > disabled
> > > it and also reduced 'max.poll.records' value.  It helped little bit.
> > >
> > > But after today's testing there is strong correlation between lag spike
> > > and what broker we crash. For lowest ID (100) broker :
> > >   1. it always at least 1-2 orders higher lag
> > >   2. we start getting
> > >
> > > org.apache.kafka.clients.consumer.CommitFailedException: Commit
> cannot be
> > > completed since the group has already rebalanced and assigned the
> > > partitions to another member. This means that the time between
> subsequent
> > > calls to poll() was longer than the configured max.poll.interval.ms,
> > > which typically implies that the poll loop is spending too much time
> > > message processing. You can address this either by increasing the
> session
> > > timeout or by reducing the maximum size of batches returned in poll()
> > with
> > > max.poll.records.
> > >
> > >   3. sometime re-consumption from random position
> > >
> > > And when we crashing other brokers (101, 102), it just lag spike of
> ~10Ks
> > > order, settle down quite quickly, no consumer exceptions.
> > >
> > > Totally lost what to try next.
> > >
> > > On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xiaotao...@gmail.com> wrote:
> > >
> > >> Do you have unclean leader election turned on? If killing 100 is the
> > only
> > >> way to reproduce the problem, it is possible with unclean leader
> > election
> > >> turned on that leadership was transferred to out of ISR follower which
> > may
> > >> not have the latest high watermark
> > >> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> > >> dvsekhval...@gmail.com>
> > >> wrote:
> > >>
> > >> > About to verify hypothesis on monday, but looks like that in latest
> > >> tests.
> > >> > Need to double check.
> > >> >
> > >> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <schiz...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > So no matter in what sequence you shutdown brokers it is only 1
> that
> > >> > causes
> > >> > > the major problem? That would indeed be a bit weird. have you
> > checked
> > >> > > offsets of your consumer - right after offsets jump back - does it
> > >> start
> > >> > > from the topic start or does it go back to some random position?
> > Have
> > >> you
> > >> > > checked if all offsets are actually being committed by consumers?
> > >> > >
> > >> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> > >> > > dvsekhval...@gmail.com
> > >> > > >:
> > >> > >
> > >> > > > Yeah, probably we can dig around.
> > >> > > >
> > >> > > > One more observation, the most lag/re-consumption trouble
> > happening
> > >> > when
> > >> > > we
> > >> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
> > >> > > > When crashing other brokers - there is nothing special
> happening,
> > >> lag
> > >> > > > growing little bit but nothing crazy (e.g. thousands, not
> > millions).
> > >> > > >
> > >> > > > Is it sounds suspicious?
> > >> > > >
> > >> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
> schiz...@gmail.com>
> > >> > wrote:
> > >> > > >
> > >> > > > > Ted: when choosing earliest/latest you are saying: if it
> happens
> > >> that
> > >> > > > there
> > >> > > > > is no "valid" offset committed for a consumer (for whatever
> > >> reason:
> > >> > > > > bug/misconfiguration/no luck) it will be ok to start from the
> > >> > beginning
> > >> > > > or
> > >> > > > > end of the topic. So if you are not ok with that you should
> > choose
> > >> > > none.
> > >> > > > >
> > >> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this
> offset
> > >> per
> > >> > > > > partition state for you. it might also has that problem of
> > leaving
> > >> > > stale
> > >> > > > > offsets lying around, After quickly looking through
> > >> > > > > https://github.com/spring-projects/spring-kafka/blob/
> > >> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> > >> > > > > main/java/org/springframework/kafka/listener/
> > >> > > > > KafkaMessageListenerContainer.java
> > >> > > > > it looks possible since offsets map is not cleared upon
> > partition
> > >> > > > > revocation, but that is just a hypothesis. I have no
> experience
> > >> with
> > >> > > > > spring-kafka. However since you say you consumers were always
> > >> active
> > >> > I
> > >> > > > find
> > >> > > > > this theory worth investigating.
> > >> > > > >
> > >> > > > >
> > >> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> > >> > > > > vincent.dautrem...@olamobile.com.invalid>:
> > >> > > > >
> > >> > > > > > is there a way to read messages on a topic partition from a
> > >> > specific
> > >> > > > node
> > >> > > > > > we that we choose (and not by the topic partition leader) ?
> > >> > > > > > I would like to read myself that each of the
> > __consumer_offsets
> > >> > > > partition
> > >> > > > > > replicas have the same consumer group offset written in it
> in
> > >> it.
> > >> > > > > >
> > >> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> > >> > > > > > dvsekhval...@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Stas:
> > >> > > > > > >
> > >> > > > > > > we rely on spring-kafka, it  commits offsets "manually"
> for
> > us
> > >> > > after
> > >> > > > > > event
> > >> > > > > > > handler completed. So it's kind of automatic once there is
> > >> > constant
> > >> > > > > > stream
> > >> > > > > > > of events (no idle time, which is true for us). Though
> it's
> > >> not
> > >> > > what
> > >> > > > > pure
> > >> > > > > > > kafka-client calls "automatic" (flush commits at fixed
> > >> > intervals).
> > >> > > > > > >
> > >> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> > >> schiz...@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > You don't have autocmmit enables that means you commit
> > >> offsets
> > >> > > > > > yourself -
> > >> > > > > > > > correct? If you store them per partition somewhere and
> > fail
> > >> to
> > >> > > > clean
> > >> > > > > it
> > >> > > > > > > up
> > >> > > > > > > > upon rebalance next time the consumer gets this
> partition
> > >> > > assigned
> > >> > > > > > during
> > >> > > > > > > > next rebalance it can commit old stale offset- can this
> be
> > >> the
> > >> > > > case?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> > >> > > > > > > > dvsekhval...@gmail.com
> > >> > > > > > > > >:
> > >> > > > > > > >
> > >> > > > > > > > > Reprocessing same events again - is fine for us
> > >> (idempotent).
> > >> > > > While
> > >> > > > > > > > loosing
> > >> > > > > > > > > data is more critical.
> > >> > > > > > > > >
> > >> > > > > > > > > What are reasons of such behaviour? Consumers are
> never
> > >> idle,
> > >> > > > > always
> > >> > > > > > > > > commiting, probably something wrong with broker setup
> > >> then?
> > >> > > > > > > > >
> > >> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> > >> yuzhih...@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Stas:
> > >> > > > > > > > > > bq.  using anything but none is not really an option
> > >> > > > > > > > > >
> > >> > > > > > > > > > If you have time, can you explain a bit more ?
> > >> > > > > > > > > >
> > >> > > > > > > > > > Thanks
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> > >> > > > schiz...@gmail.com
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > If you set auto.offset.reset to none next time it
> > >> happens
> > >> > > you
> > >> > > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > in
> > >> > > > > > > > > > > much better position to find out what happens.
> Also
> > in
> > >> > > > general
> > >> > > > > > with
> > >> > > > > > > > > > current
> > >> > > > > > > > > > > semantics of offset reset policy IMO using
> anything
> > >> but
> > >> > > none
> > >> > > > is
> > >> > > > > > not
> > >> > > > > > > > > > really
> > >> > > > > > > > > > > an option unless it is ok for consumer to loose
> some
> > >> data
> > >> > > > > > (latest)
> > >> > > > > > > or
> > >> > > > > > > > > > > reprocess it second time (earliest).
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> > >> > > yuzhih...@gmail.com
> > >> > > > >:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Should Kafka log warning if log.retention.hours
> is
> > >> > lower
> > >> > > > than
> > >> > > > > > > > number
> > >> > > > > > > > > of
> > >> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> > >> > > > > > > > manikumar.re...@gmail.com
> > >> > > > > > > > > >
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should
> > be
> > >> > > higher
> > >> > > > > than
> > >> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> > >> Vsekhvalnov <
> > >> > > > > > > > > > > > > dvsekhval...@gmail.com>
> > >> > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi Ted,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Broker: v0.11.0.0
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Consumer:
> > >> > > > > > > > > > > > > > kafka-clients v0.11.0.0
> > >> > > > > > > > > > > > > > auto.offset.reset = earliest
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> > >> > > > > > yuzhih...@gmail.com>
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Which release are you using ?
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Cheers
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> > >> > > Vsekhvalnov <
> > >> > > > > > > > > > > > > > > dvsekhval...@gmail.com>
> > >> > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Hi all,
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > we several time faced situation where
> > >> > > > consumer-group
> > >> > > > > > > > started
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > > > > re-consume
> > >> > > > > > > > > > > > > > > > old events from beginning. Here is
> > scenario:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3
> > node
> > >> > > > > zookeeper
> > >> > > > > > > > > > > > > > > > 2. RF=3 for all topics
> > >> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
> > >> > > > > > > > > offsets.retention.minutes=20160
> > >> > > > > > > > > > > > > > > > 4. running sustainable load (pushing
> > events)
> > >> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
> > >> shutting
> > >> > > > down 1
> > >> > > > > > of
> > >> > > > > > > 3
> > >> > > > > > > > > > broker
> > >> > > > > > > > > > > > > nodes
> > >> > > > > > > > > > > > > > > > (then provision new broker back)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Several times after bouncing broker we
> > faced
> > >> > > > > situation
> > >> > > > > > > > where
> > >> > > > > > > > > > > > consumer
> > >> > > > > > > > > > > > > > > group
> > >> > > > > > > > > > > > > > > > started to re-consume old events.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > consumer group:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
> > >> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill
> -9
> > >> and
> > >> > > > > > terminating
> > >> > > > > > > > AWS
> > >> > > > > > > > > > > nodes
> > >> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
> > >> given
> > >> > > > cases.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > What can cause that old events
> > >> re-consumption?
> > >> > Is
> > >> > > > it
> > >> > > > > > > > related
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > > > bouncing
> > >> > > > > > > > > > > > > > > > one of brokers? What to search in a
> logs?
> > >> Any
> > >> > > > broker
> > >> > > > > > > > settings
> > >> > > > > > > > > > to
> > >> > > > > > > > > > > > try?
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Thanks in advance.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > The information transmitted is intended only for the person
> or
> > >> > entity
> > >> > > > to
> > >> > > > > > which it is addressed and may contain confidential and/or
> > >> > privileged
> > >> > > > > > material. Any review, retransmission, dissemination or other
> > use
> > >> > of,
> > >> > > or
> > >> > > > > > taking of any action in reliance upon, this information by
> > >> persons
> > >> > or
> > >> > > > > > entities other than the intended recipient is prohibited. If
> > you
> > >> > > > received
> > >> > > > > > this in error, please contact the sender and delete the
> > material
> > >> > from
> > >> > > > any
> > >> > > > > > computer.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
> --
>
>
> This email, including attachments, is private and confidential. If you have
> received this email in error please notify the sender and delete it from
> your system. Emails are not secure and may contain viruses. No liability
> can be accepted for viruses that might be transferred by this email or any
> attachment. Any unauthorised copying of this message or unauthorised
> distribution and publication of the information contained herein are
> prohibited.
>
> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB.
> Registered in England and Wales. Registered No. 04843573.
>

Re: kafka broker loosing offsets?

Reply via email to