Re: Debugging Kafka Streams Windowing

Guozhang Wang Mon, 22 May 2017 16:10:23 -0700

Hi Manhendra,

Sorry for the late reply.


Just to clarify my previous reply was only for your question about:

"
There is also another issue where a particular broker is marked as dead for
a group id and Streams process never recovers from this exception.
"

And I thought your attached logs are associating with the above described
"exception":

"
16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.AbstractCoordinator -
Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead for
group grp_id
16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator -
Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for group
grp_id.
16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator -
Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead for
group grp_id
16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator -
Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for group
grp_id.
16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator -
Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead for
group grp_id
16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator -
Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for group
grp_id.
16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator -
Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead for
group grp_id
16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator -
Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for group
grp_id.
16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator -
Marking the coordinator broker-05:6667 (id: 2147483642 rack: null) dead for
group grp_id
16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.AbstractCoordinator -
Discovered coordinator broker-05:6667 (id: 2147483642 rack: null) for group
grp_id.

"

I thought you meant that the broker (05:6667) indeed become un-available
transiently, and that streams hence cannot recover from the above state.
But from your last email, it seems that the broker was not actually
crashed, but that some streams' embedded consumer coordinator continuously
mark it as dead and keep trying to re-discover the broker again, right? For
that issue I'd suspect that there is a network issue, or maybe the network
is just saturated already and the heartbeat request / response were not
exchanged in time between the consumer and the broker, or the sockets being
dropped because of socket limit. Under this cases not all consumers may be
affected, but since the associated issue is from "AbstractCoordinator"
class which is part of the consumer client, I'd still be surprised if it is
actually due to Streams itself with the same consumer config settings, but
not to consumers.

Guozhang

On Tue, May 16, 2017 at 8:58 PM, Mahendra Kariya <mahendra.kar...@go-jek.com
> wrote:

> I am confused. If what you have mentioned is the case, then
>
>    - Why would restarting the stream processes resolve the issue?
>    - Why do we get these infinite stream of exceptions only on some boxes
>    in the cluster and not all?
>    - We have tens of other consumers running just fine. We see this issue
>    only in the streams one.
>
>
>
>
> On Tue, May 16, 2017 at 3:36 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Sorry I mis-read your email and confused it with another thread.
> >
> > As for your observed issue, it seems "broker-05:6667" is in an unstable
> > state which is the group coordinator for this stream process app with app
> > id (i.e. group id) "grp_id". Since the streams app cannot commit offsets
> > anymore due to group coordinator not available, it cannot proceed but
> > repeatedly re-discovers the coordinator.
> >
> > This is not generally an issue for streams, but for consumer group
> > membership management. In practice you need to make sure that the offset
> > topic is replicate (I think by default it is 3 replicas) so that whenever
> > the leader of a certain offset topic partition, hence the group
> > coordinator, fails, another broker can take over so that any consumer
> > groups that is corresponding to that offset topic partition won't be
> > blocked.
> >
> >
> > Guozhang
> >
> >
> >
> > On Mon, May 15, 2017 at 7:33 PM, Mahendra Kariya <
> > mahendra.kar...@go-jek.com
> > > wrote:
> >
> > > Thanks for the reply Guozhang! But I think we are talking of 2
> different
> > > issues here. KAFKA-5167 is for LockException. We face this issue
> > > intermittently, but not a lot.
> > >
> > > There is also another issue where a particular broker is marked as dead
> > for
> > > a group id and Streams process never recovers from this exception.
> > >
> > > On Mon, May 15, 2017 at 11:28 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > >
> > > > I'm wondering if it is possibly due to KAFKA-5167? In that case, the
> > > "other
> > > > thread" will keep retrying on grabbing the lock.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Sat, May 13, 2017 at 7:30 PM, Mahendra Kariya <
> > > > mahendra.kar...@go-jek.com
> > > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > There is no missing data. But the INFO level logs are infinite and
> > the
> > > > > streams practically stops. For the messages that I posted, we got
> > these
> > > > > INFO logs for around 20 mins. After which we got an alert about no
> > data
> > > > > being produced in the sink topic and we had to restart the streams
> > > > > processes.
> > > > >
> > > > >
> > > > >
> > > > > On Sun, May 14, 2017 at 1:01 AM, Matthias J. Sax <
> > > matth...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I just dug a little bit. The messages are logged at INFO level
> and
> > > thus
> > > > > > should not be a problem if they go away by themselves after some
> > > time.
> > > > > > Compare:
> > > > > > https://groups.google.com/forum/#!topic/confluent-
> > > platform/A14dkPlDlv4
> > > > > >
> > > > > > Do you still see missing data?
> > > > > >
> > > > > >
> > > > > > -Matthias
> > > > > >
> > > > > >
> > > > > > On 5/11/17 2:39 AM, Mahendra Kariya wrote:
> > > > > > > Hi Matthias,
> > > > > > >
> > > > > > > We faced the issue again. The logs are below.
> > > > > > >
> > > > > > > 16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) dead
> > > > > > for
> > > > > > > group grp_id
> > > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) for
> > > > > > group
> > > > > > > grp_id.
> > > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) dead
> > > > > > for
> > > > > > > group grp_id
> > > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) for
> > > > > > group
> > > > > > > grp_id.
> > > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) dead
> > > > > > for
> > > > > > > group grp_id
> > > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) for
> > > > > > group
> > > > > > > grp_id.
> > > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) dead
> > > > > > for
> > > > > > > group grp_id
> > > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) for
> > > > > > group
> > > > > > > grp_id.
> > > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Marking the coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) dead
> > > > > > for
> > > > > > > group grp_id
> > > > > > > 16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.
> > AbstractCoordinator
> > > -
> > > > > > > Discovered coordinator broker-05:6667 (id: 2147483642
> > > > > <(214)%20748-3642> rack: null) for
> > > > > > group
> > > > > > > grp_id.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, May 9, 2017 at 3:40 AM, Matthias J. Sax <
> > > > matth...@confluent.io
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Great! Glad 0.10.2.1 fixes it for you!
> > > > > > >>
> > > > > > >> -Matthias
> > > > > > >>
> > > > > > >> On 5/7/17 8:57 PM, Mahendra Kariya wrote:
> > > > > > >>> Upgrading to 0.10.2.1 seems to have fixed the issue.
> > > > > > >>>
> > > > > > >>> Until now, we were looking at random 1 hour data to analyse
> the
> > > > > issue.
> > > > > > >> Over
> > > > > > >>> the weekend, we have written a simple test that will
> > continuously
> > > > > check
> > > > > > >> for
> > > > > > >>> inconsistencies in real time and report if there is any
> issue.
> > > > > > >>>
> > > > > > >>> No issues have been reported for the last 24 hours. Will
> update
> > > > this
> > > > > > >> thread
> > > > > > >>> if we find any issue.
> > > > > > >>>
> > > > > > >>> Thanks for all the support!
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Fri, May 5, 2017 at 3:55 AM, Matthias J. Sax <
> > > > > matth...@confluent.io
> > > > > > >
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> About
> > > > > > >>>>
> > > > > > >>>>> 07:44:08.493 [StreamThread-10] INFO
> > > > o.a.k.c.c.i.AbstractCoordinato
> > > > > r
> > > > > > -
> > > > > > >>>>> Discovered coordinator broker-05:6667 for group group-2.
> > > > > > >>>>
> > > > > > >>>> Please upgrade to Streams 0.10.2.1 -- we fixed couple of bug
> > > and I
> > > > > > would
> > > > > > >>>> assume this issue is fixed, too. If not, please report back.
> > > > > > >>>>
> > > > > > >>>>> Another question that I have is, is there a way for us
> detect
> > > how
> > > > > > many
> > > > > > >>>>> messages have come out of order? And if possible, what is
> the
> > > > > delay?
> > > > > > >>>>
> > > > > > >>>> There is no metric or api for this. What you could do though
> > is,
> > > > to
> > > > > > use
> > > > > > >>>> #transform() that only forwards each record and as a side
> > task,
> > > > > > extracts
> > > > > > >>>> the timestamp via `context#timestamp()` and does some book
> > > keeping
> > > > > to
> > > > > > >>>> compute if out-of-order and what the delay was.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>>>>  - same for .mapValues()
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>> I am not sure how to check this.
> > > > > > >>>>
> > > > > > >>>> The same way as you do for filter()?
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> -Matthias
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On 5/4/17 10:29 AM, Mahendra Kariya wrote:
> > > > > > >>>>> Hi Matthias,
> > > > > > >>>>>
> > > > > > >>>>> Please find the answers below.
> > > > > > >>>>>
> > > > > > >>>>> I would recommend to double check the following:
> > > > > > >>>>>>
> > > > > > >>>>>>  - can you confirm that the filter does not remove all
> data
> > > for
> > > > > > those
> > > > > > >>>>>> time periods?
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> Filter does not remove all data. There is a lot of data
> > coming
> > > in
> > > > > > even
> > > > > > >>>>> after the filter stage.
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>>  - I would also check input for your AggregatorFunction()
> --
> > > > does
> > > > > it
> > > > > > >>>>>> receive everything?
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> Yes. Aggregate function seems to be receiving everything.
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>>  - same for .mapValues()
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> I am not sure how to check this.
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: Debugging Kafka Streams Windowing

Reply via email to