Sorry I mis-read your email and confused it with another thread. As for your observed issue, it seems "broker-05:6667" is in an unstable state which is the group coordinator for this stream process app with app id (i.e. group id) "grp_id". Since the streams app cannot commit offsets anymore due to group coordinator not available, it cannot proceed but repeatedly re-discovers the coordinator.
This is not generally an issue for streams, but for consumer group membership management. In practice you need to make sure that the offset topic is replicate (I think by default it is 3 replicas) so that whenever the leader of a certain offset topic partition, hence the group coordinator, fails, another broker can take over so that any consumer groups that is corresponding to that offset topic partition won't be blocked. Guozhang On Mon, May 15, 2017 at 7:33 PM, Mahendra Kariya <mahendra.kar...@go-jek.com > wrote: > Thanks for the reply Guozhang! But I think we are talking of 2 different > issues here. KAFKA-5167 is for LockException. We face this issue > intermittently, but not a lot. > > There is also another issue where a particular broker is marked as dead for > a group id and Streams process never recovers from this exception. > > On Mon, May 15, 2017 at 11:28 PM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > I'm wondering if it is possibly due to KAFKA-5167? In that case, the > "other > > thread" will keep retrying on grabbing the lock. > > > > Guozhang > > > > > > On Sat, May 13, 2017 at 7:30 PM, Mahendra Kariya < > > mahendra.kar...@go-jek.com > > > wrote: > > > > > Hi, > > > > > > There is no missing data. But the INFO level logs are infinite and the > > > streams practically stops. For the messages that I posted, we got these > > > INFO logs for around 20 mins. After which we got an alert about no data > > > being produced in the sink topic and we had to restart the streams > > > processes. > > > > > > > > > > > > On Sun, May 14, 2017 at 1:01 AM, Matthias J. Sax < > matth...@confluent.io> > > > wrote: > > > > > > > Hi, > > > > > > > > I just dug a little bit. The messages are logged at INFO level and > thus > > > > should not be a problem if they go away by themselves after some > time. > > > > Compare: > > > > https://groups.google.com/forum/#!topic/confluent- > platform/A14dkPlDlv4 > > > > > > > > Do you still see missing data? > > > > > > > > > > > > -Matthias > > > > > > > > > > > > On 5/11/17 2:39 AM, Mahendra Kariya wrote: > > > > > Hi Matthias, > > > > > > > > > > We faced the issue again. The logs are below. > > > > > > > > > > 16:13:16.527 [StreamThread-7] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) dead > > > > for > > > > > group grp_id > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) for > > > > group > > > > > grp_id. > > > > > 16:13:16.543 [StreamThread-3] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) dead > > > > for > > > > > group grp_id > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) for > > > > group > > > > > grp_id. > > > > > 16:13:16.547 [StreamThread-6] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) dead > > > > for > > > > > group grp_id > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) for > > > > group > > > > > grp_id. > > > > > 16:13:16.551 [StreamThread-1] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) dead > > > > for > > > > > group grp_id > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) for > > > > group > > > > > grp_id. > > > > > 16:13:16.572 [StreamThread-4] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Marking the coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) dead > > > > for > > > > > group grp_id > > > > > 16:13:16.573 [StreamThread-2] INFO o.a.k.c.c.i.AbstractCoordinator > - > > > > > Discovered coordinator broker-05:6667 (id: 2147483642 > > > <(214)%20748-3642> rack: null) for > > > > group > > > > > grp_id. > > > > > > > > > > > > > > > > > > > > On Tue, May 9, 2017 at 3:40 AM, Matthias J. Sax < > > matth...@confluent.io > > > > > > > > > wrote: > > > > > > > > > >> Great! Glad 0.10.2.1 fixes it for you! > > > > >> > > > > >> -Matthias > > > > >> > > > > >> On 5/7/17 8:57 PM, Mahendra Kariya wrote: > > > > >>> Upgrading to 0.10.2.1 seems to have fixed the issue. > > > > >>> > > > > >>> Until now, we were looking at random 1 hour data to analyse the > > > issue. > > > > >> Over > > > > >>> the weekend, we have written a simple test that will continuously > > > check > > > > >> for > > > > >>> inconsistencies in real time and report if there is any issue. > > > > >>> > > > > >>> No issues have been reported for the last 24 hours. Will update > > this > > > > >> thread > > > > >>> if we find any issue. > > > > >>> > > > > >>> Thanks for all the support! > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Fri, May 5, 2017 at 3:55 AM, Matthias J. Sax < > > > matth...@confluent.io > > > > > > > > > >>> wrote: > > > > >>> > > > > >>>> About > > > > >>>> > > > > >>>>> 07:44:08.493 [StreamThread-10] INFO > > o.a.k.c.c.i.AbstractCoordinato > > > r > > > > - > > > > >>>>> Discovered coordinator broker-05:6667 for group group-2. > > > > >>>> > > > > >>>> Please upgrade to Streams 0.10.2.1 -- we fixed couple of bug > and I > > > > would > > > > >>>> assume this issue is fixed, too. If not, please report back. > > > > >>>> > > > > >>>>> Another question that I have is, is there a way for us detect > how > > > > many > > > > >>>>> messages have come out of order? And if possible, what is the > > > delay? > > > > >>>> > > > > >>>> There is no metric or api for this. What you could do though is, > > to > > > > use > > > > >>>> #transform() that only forwards each record and as a side task, > > > > extracts > > > > >>>> the timestamp via `context#timestamp()` and does some book > keeping > > > to > > > > >>>> compute if out-of-order and what the delay was. > > > > >>>> > > > > >>>> > > > > >>>>>>> - same for .mapValues() > > > > >>>>>>> > > > > >>>>>> > > > > >>>>>> I am not sure how to check this. > > > > >>>> > > > > >>>> The same way as you do for filter()? > > > > >>>> > > > > >>>> > > > > >>>> -Matthias > > > > >>>> > > > > >>>> > > > > >>>> On 5/4/17 10:29 AM, Mahendra Kariya wrote: > > > > >>>>> Hi Matthias, > > > > >>>>> > > > > >>>>> Please find the answers below. > > > > >>>>> > > > > >>>>> I would recommend to double check the following: > > > > >>>>>> > > > > >>>>>> - can you confirm that the filter does not remove all data > for > > > > those > > > > >>>>>> time periods? > > > > >>>>>> > > > > >>>>> > > > > >>>>> Filter does not remove all data. There is a lot of data coming > in > > > > even > > > > >>>>> after the filter stage. > > > > >>>>> > > > > >>>>> > > > > >>>>>> - I would also check input for your AggregatorFunction() -- > > does > > > it > > > > >>>>>> receive everything? > > > > >>>>>> > > > > >>>>> > > > > >>>>> Yes. Aggregate function seems to be receiving everything. > > > > >>>>> > > > > >>>>> > > > > >>>>>> - same for .mapValues() > > > > >>>>>> > > > > >>>>> > > > > >>>>> I am not sure how to check this. > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang