Face the similar issue in kafka 0.10.0.1. Going through the kafka code
figured that when coordinator goes down the other ISR scans whole log
file of partition of __consumer_offsets for my consumer group to
update the cache of offsets. In my case its size was around ~600G
which took around ~40 mins during which consumers were without
coordinator. So duration of consumers being in this state depends on
how big log file of partition is.


Did following changes in broker config to fix it:


log.cleaner.enable=true


(This enabled the __consumer_offsets log files to compact every 10 mins).



On Sun, May 14, 2017 at 1:01 AM, Matthias J. Sax <matth...@confluent.io>
wrote:

> Hi,
>
> I just dug a little bit. The messages are logged at INFO level and thus
> should not be a problem if they go away by themselves after some time.
> Compare:
> https://groups.google.com/forum/#!topic/confluent-platform/A14dkPlDlv4
>
> Do you still see missing data?
>
>
> -Matthias
>
>
> On 5/11/17 2:39 AM, Mahendra Kariya wrote:
> > Hi Matthias,
> >
> > We faced the issue again. The logs are below.
> >
> > 16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead
> for
> > group grp_id
> > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for
> group
> > grp_id.
> > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead
> for
> > group grp_id
> > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for
> group
> > grp_id.
> > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead
> for
> > group grp_id
> > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for
> group
> > grp_id.
> > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead
> for
> > group grp_id
> > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for
> group
> > grp_id.
> > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead
> for
> > group grp_id
> > 16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.AbstractCoordinator -
> > Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for
> group
> > grp_id.
> >
> >
> >
> > On Tue, May 9, 2017 at 3:40 AM, Matthias J. Sax <matth...@confluent.io>
> > wrote:
> >
> >> Great! Glad 0.10.2.1 fixes it for you!
> >>
> >> -Matthias
> >>
> >> On 5/7/17 8:57 PM, Mahendra Kariya wrote:
> >>> Upgrading to 0.10.2.1 seems to have fixed the issue.
> >>>
> >>> Until now, we were looking at random 1 hour data to analyse the issue.
> >> Over
> >>> the weekend, we have written a simple test that will continuously check
> >> for
> >>> inconsistencies in real time and report if there is any issue.
> >>>
> >>> No issues have been reported for the last 24 hours. Will update this
> >> thread
> >>> if we find any issue.
> >>>
> >>> Thanks for all the support!
> >>>
> >>>
> >>>
> >>> On Fri, May 5, 2017 at 3:55 AM, Matthias J. Sax <matth...@confluent.io
> >
> >>> wrote:
> >>>
> >>>> About
> >>>>
> >>>>> 07:44:08.493 [StreamThread-10] INFO o.a.k.c.c.i.AbstractCoordinator
> -
> >>>>> Discovered coordinator broker-05:6667 for group group-2.
> >>>>
> >>>> Please upgrade to Streams 0.10.2.1 -- we fixed couple of bug and I
> would
> >>>> assume this issue is fixed, too. If not, please report back.
> >>>>
> >>>>> Another question that I have is, is there a way for us detect how
> many
> >>>>> messages have come out of order? And if possible, what is the delay?
> >>>>
> >>>> There is no metric or api for this. What you could do though is, to
> use
> >>>> #transform() that only forwards each record and as a side task,
> extracts
> >>>> the timestamp via `context#timestamp()` and does some book keeping to
> >>>> compute if out-of-order and what the delay was.
> >>>>
> >>>>
> >>>>>>>  - same for .mapValues()
> >>>>>>>
> >>>>>>
> >>>>>> I am not sure how to check this.
> >>>>
> >>>> The same way as you do for filter()?
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 5/4/17 10:29 AM, Mahendra Kariya wrote:
> >>>>> Hi Matthias,
> >>>>>
> >>>>> Please find the answers below.
> >>>>>
> >>>>> I would recommend to double check the following:
> >>>>>>
> >>>>>>  - can you confirm that the filter does not remove all data for
> those
> >>>>>> time periods?
> >>>>>>
> >>>>>
> >>>>> Filter does not remove all data. There is a lot of data coming in
> even
> >>>>> after the filter stage.
> >>>>>
> >>>>>
> >>>>>>  - I would also check input for your AggregatorFunction() -- does it
> >>>>>> receive everything?
> >>>>>>
> >>>>>
> >>>>> Yes. Aggregate function seems to be receiving everything.
> >>>>>
> >>>>>
> >>>>>>  - same for .mapValues()
> >>>>>>
> >>>>>
> >>>>> I am not sure how to check this.
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Reply via email to