Hey Vinay, Thanks, that's really helpful. It does seem like there might be a problem with the heartbeat trigger logic. I'll see if I can reproduce what you're seeing locally. Might be helpful if you share a snippet of your poll loop.
Thanks, Jason On Thu, Apr 28, 2016 at 11:55 AM, vinay sharma <vinsharma.t...@gmail.com> wrote: > Hi Jason, > > i reverted back to KAFKA-3149. Producer still had issues related to schema > but my consumer worked. > > Now consumer worked as expected. Although i did not encountered an error > and generation was not marked dead by coordinator but i still see that > successful heartbeat response are not logged as expected. > My observation is following:- > 1) Meta refresh also triggers heartbeat request. I say this because > sometimes i see 2 heartbeat responses logged just a few milliseconds away > where meta refresh and proactive commit happened almost simultaneously. > 2) I still see that some commitSync requests do not have a heartbeat > logged before or after commit. Although next proactive commit happened just > in time and this time heartbeat request was successful hence saved session. > In attached log you can see that poll was done at 14:17:41, a commit > happened at 14:17:56 and another commit happened at 14:18:14. The only > heart beat response logged during this time is at 14:18:14 which is 29 > seconds after poll where as a commit was performed 15 seconds after poll. > Heartbeat interval was 3000. > 3) There are long pauses in heartbeat responses in logs which should cause > session to timeout but its not happening. This implies that commits trigger > a heartbeat but they also act as heartbeat. > > > Regards, > Vinay > > > On Thu, Apr 28, 2016 at 12:29 PM, Jason Gustafson <ja...@confluent.io> > wrote: > >> Ah, yeah. That's probably caused by the new topic metadata version, which >> isn't supported on 0.9 brokers. To test on trunk, you'd have to upgrade >> the >> brokers as well. Either that or you can rewind to before KAFKA-3306 (which >> was just committed the day before yesterday)? >> >> -Jason >> >> On Thu, Apr 28, 2016 at 9:01 AM, vinay sharma <vinsharma.t...@gmail.com> >> wrote: >> >> > Hi Jason, >> > >> > I build kafka-client and tried using it but my producers and consumers >> > started throwing below exception. Is 0.10 not going to be compatible >> with >> > brokers on version 0.9.0.1? or do i need to make some config changes to >> > producers / consumers to make them compatible with brokers on old >> version? >> > or do i need to upgrade brokers to new version as well? >> > >> > org.apache.kafka.common.protocol.types.SchemaException: Error reading >> > field 'brokers': Error reading field 'host': Error reading string of >> length >> > 17995, only 145 bytes available >> > at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75) >> > at >> > >> > >> org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:380) >> > at >> > >> > >> org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:449) >> > >> > Regards, >> > Vinay Sharma >> > >> > On Thu, Apr 28, 2016 at 12:32 AM, Jason Gustafson <ja...@confluent.io> >> > wrote: >> > >> > > Hey Vinay, >> > > >> > > Any chance you can run the same test against trunk? I'm guessing this >> > might >> > > be caused by a bug in the 0.9 consumer which basically causes some >> > requests >> > > to fail when a bunch of them are sent to the broker at the same time. >> > > >> > > -Jason >> > > >> > > On Wed, Apr 27, 2016 at 1:02 PM, vinay sharma < >> vinsharma.t...@gmail.com> >> > > wrote: >> > > >> > > > Hi Jason, >> > > > >> > > > This makes sense.We use 0.9.0.1 and we do have session timeout set a >> > bit >> > > > high but nothing can guarantee that there will be no case when >> > processing >> > > > may not go higher than session timeout. I am trying to test a >> proactive >> > > > commit approach to handle such cases when processing takes unusually >> > long >> > > > time. To keep consumer's session alive during long processing time i >> > > > proactively commitSync processed records every 15 seconds. Session >> > > timeout >> > > > i kept is 30000. >> > > > >> > > > *Problem:-* >> > > > With heart beat interval is 3000 then i expect a hearbeat request >> to be >> > > > sent on each proactive commit which happens every 15 seconds. In my >> > > tests i >> > > > see that this does not happen always. I see a time window which is >> > > greater >> > > > than 30 seconds where no hearbeat is sent even thought there were >> > commits >> > > > in this duration. After this window i see a couple of successful >> > > heartbeat >> > > > responses till the end of poll but as soon as i poll again and call >> > > > commitSync in next poll i get "ILLEGAL_GENERATION" error. This error >> > > always >> > > > happen just after meta refresh or in next poll processing after a >> meta >> > > > refresh. I am attaching logs where i kept meta refresh interval >> 40000, >> > > > 90000, 500000. >> > > > >> > > > *Test results *:- >> > > > Test with meta refresh 40000 ms ran around 70 seconds from 1st poll. >> > > > Test with meta refresh 90000 ms ran around 120 seconds from 1st >> poll. >> > > > Test with meta refresh 500000 ms ran around 564 seconds from 1st >> poll. >> > > > >> > > > Every test falls in line with above test cases where generation is >> > marked >> > > > dead some time after a meta refresh. Meta refresh before 1st poll >> does >> > > not >> > > > create any issue but the ones after poll and during long processing >> do. >> > > > >> > > > *Environment:-* >> > > > My setup has 3 brokers 1 zk. Topic has 3 partitions ans has >> replication >> > > > factor 3. Messages are already published to topic. >> > > > >> > > > *Logic used in test cases :- * >> > > > On each poll I initialize a map with current committed offset >> position >> > of >> > > > partitions being consumed. I update this map after each record >> > processing >> > > > and use this map to proactively commit every 15 seconds. Map is >> > > initialized >> > > > again after a proactive commit. >> > > > >> > > > I am not sure what is wrong here but i do not see any issue in code >> or >> > > > offset commits going on. Log files and a class with main method are >> > > > attached for your reference. >> > > > >> > > > Regards, >> > > > Vinay Sharma >> > > > >> > > > >> > > > >> > > > On Wed, Apr 27, 2016 at 2:46 PM, Jason Gustafson < >> ja...@confluent.io> >> > > > wrote: >> > > > >> > > >> Hi Vinay, >> > > >> >> > > >> Answers below: >> > > >> >> > > >> 1) Is it correct to say that each commitSync will trigger a >> > > >> HeartBeatTask? >> > > >> > If there is no hear beat sent in past since specified heartbeat >> > > interval >> > > >> > then i should see a successful heartbeat response or failure >> message >> > > in >> > > >> > logs near to commitSync success log? >> > > >> >> > > >> >> > > >> Not quite. Heartbeats are sent periodically according to the >> > > >> heartbeat.interval.ms configuration. However, since the consumer >> has >> > no >> > > >> background thread, they can only be sent in API calls such as >> poll() >> > or >> > > >> commitSync(). So calling commitSync() may or may not result in a >> > > heartbeat >> > > >> depending only on whether one is "due." >> > > >> >> > > >> 2) is it correct to say that Meta Data refresh will not act as >> > > heartbeat, >> > > >> > will not trigger heartBeatTask and will not reset heartBeatTask? >> > > >> >> > > >> >> > > >> That is correct. Metadata refreshes are not related to heartbeats. >> > > >> >> > > >> 3) Where does a consumer session maintained? Lets say my consumer >> is >> > > >> > listening to 3 partitions on a 3 broker cluster where each >> broker is >> > > >> leader >> > > >> > of 1 partition. So will each of the brokers will have a session >> for >> > my >> > > >> > consumer or is it just 1 session maintained somewhere in common >> like >> > > >> > zookeeper? >> > > >> >> > > >> >> > > >> One of the brokers serves as the "group coordinator." When the >> > consumer >> > > >> starts up, it sends a GroupCoordinator request to one of the >> brokers >> > to >> > > >> find out who the coordinator is. Currently, coordinators are chosen >> > from >> > > >> among the leaders of the partitions of the __consumer_offsets >> topic. >> > > This >> > > >> lets us take advantage of the leader election process to also >> handle >> > > >> coordinator failures. The coordinator of each group maintains state >> > for >> > > >> the >> > > >> group and keeps track of session timeouts. >> > > >> >> > > >> 4) In above setup, during a long processing if I commit a record >> > through >> > > >> > commmitSync which triggers a hear beat request and a successful >> > > >> response is >> > > >> > received for the same then what does this response means? does it >> > mean >> > > >> that >> > > >> > my session with each broker is renewed? or does it mean that just >> > the >> > > >> > leader for partition of committed record knows that my consumer >> is >> > > alive >> > > >> > and consumer's session on other brokers will still timeout? >> > > >> >> > > >> >> > > >> The coordinator is the only broker that is aware of a consumer's >> > session >> > > >> and all offset commits are sent to it. Successful heartbeats mean >> that >> > > the >> > > >> session is still active. Heartbeats are also used to let the >> consumer >> > > >> discover when a rebalance has begun. If a new member joins the >> group, >> > > then >> > > >> the coordinator returns an error code in the heartbeat responses of >> > the >> > > >> active members to let them know that they need to rejoin the group >> so >> > > that >> > > >> partitions can be rebalanced. >> > > >> >> > > >> I wouldn't get too hung up on commit/heartbeat behavior. The crux >> of >> > the >> > > >> issue is that you need to call poll() often enough to avoid getting >> > > timed >> > > >> out by the coordinator. If you find this happening frequently, you >> > > >> probably >> > > >> need to increase session.timeout.ms. There's not really any >> downside >> > to >> > > >> doing so other than that hard failures (in which the consumer >> can't be >> > > >> shutdown cleanly) will take a little longer to detect. Normal >> shutdown >> > > >> doesn't have this problem. It can be difficult in 0.9 to ensure >> that >> > > >> poll() >> > > >> is called often enough since you don't have direct control over the >> > > amount >> > > >> of data returned in poll(), but we're adding an option >> > > (max.poll.records) >> > > >> in 0.10 which hopefully can be set conservatively enough to make >> this >> > > >> problem go away. >> > > >> >> > > >> -Jason >> > > >> >> > > >> On Wed, Apr 27, 2016 at 7:11 AM, vinay sharma < >> > vinsharma.t...@gmail.com >> > > > >> > > >> wrote: >> > > >> >> > > >> > Hey, >> > > >> > >> > > >> > I am working on a simplified test case to check if there is any >> > issue >> > > >> in my >> > > >> > code. Just to make sure that any of my assumptions are not >> wrong, it >> > > >> will >> > > >> > be great if you can please help me in finding answers to >> following >> > > >> > queries:- >> > > >> > >> > > >> > 1) Is it correct to say that each commitSync will trigger a >> > > >> HeartBeatTask? >> > > >> > If there is no hear beat sent in past since specified heartbeat >> > > interval >> > > >> > then i should see a successful heartbeat response or failure >> message >> > > in >> > > >> > logs near to commitSync success log? >> > > >> > 2) is it correct to say that Meta Data refresh will not act as >> > > >> heartbeat, >> > > >> > will not trigger heartBeatTask and will not reset heartBeatTask? >> > > >> > 3) Where does a consumer session maintained? Lets say my >> consumer is >> > > >> > listening to 3 partitions on a 3 broker cluster where each >> broker is >> > > >> leader >> > > >> > of 1 partition. So will each of the brokers will have a session >> for >> > my >> > > >> > consumer or is it just 1 session maintained somewhere in common >> like >> > > >> > zookeeper? >> > > >> > 4) In above setup, during a long processing if I commit a record >> > > through >> > > >> > commmitSync which triggers a hear beat request and a successful >> > > >> response is >> > > >> > received for the same then what does this response means? does it >> > mean >> > > >> that >> > > >> > my session with each broker is renewed? or does it mean that just >> > the >> > > >> > leader for partition of committed record knows that my consumer >> is >> > > alive >> > > >> > and consumer's session on other brokers will still timeout? >> > > >> > >> > > >> > Regards, >> > > >> > Vinay Sharma >> > > >> > >> > > >> > On Tue, Apr 26, 2016 at 2:38 PM, Jason Gustafson < >> > ja...@confluent.io> >> > > >> > wrote: >> > > >> > >> > > >> > > Hey Vinay, >> > > >> > > >> > > >> > > Are you saying that heartbeats are not sent while a metadata >> > refresh >> > > >> is >> > > >> > in >> > > >> > > progress? Do you have any logs which show us the apparent >> problem? >> > > >> > > >> > > >> > > Thanks, >> > > >> > > Jason >> > > >> > > >> > > >> > > On Tue, Apr 26, 2016 at 8:18 AM, vinay sharma < >> > > >> vinsharma.t...@gmail.com> >> > > >> > > wrote: >> > > >> > > >> > > >> > > > Hi Ismael, >> > > >> > > > >> > > >> > > > Treating commitSync as heartbeat will definitely resolve the >> > issue >> > > >> i am >> > > >> > > > facing but the reason behind my issue does not seem to be >> what >> > > >> > mentioned >> > > >> > > in >> > > >> > > > defect (i.e frequent commitSync requests). >> > > >> > > > >> > > >> > > > I am sending CommitSync periodically only to keep my session >> > alive >> > > >> when >> > > >> > > my >> > > >> > > > consumer is still processing records and is close to session >> > time >> > > >> out >> > > >> > > > (tried 10th / 12th / 15th / 20th second after poll called >> where >> > > >> session >> > > >> > > > time is 30). I see heartbeat response received in logs along >> > with >> > > >> each >> > > >> > > > commitSync call but this stops after a meta data refresh >> request >> > > is >> > > >> > > issued. >> > > >> > > > I see in logs that commit goes successful but no heartbeat >> > > response >> > > >> > > > received message in logs after meta refresh till next poll. >> > > >> > > > >> > > >> > > > Regards, >> > > >> > > > Vinay Sharma >> > > >> > > > >> > > >> > > > On Mon, Apr 25, 2016 at 5:06 PM, Ismael Juma < >> ism...@juma.me.uk >> > > >> > > >> > wrote: >> > > >> > > > >> > > >> > > > > Hi Vinay, >> > > >> > > > > >> > > >> > > > > This was fixed via >> > > >> https://issues.apache.org/jira/browse/KAFKA-3470 >> > > >> > > > (will >> > > >> > > > > be part of 0.10.0.0). >> > > >> > > > > >> > > >> > > > > Ismael >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > On Mon, Apr 25, 2016 at 1:52 PM, vinay sharma < >> > > >> > > vinsharma.t...@gmail.com> >> > > >> > > > > wrote: >> > > >> > > > > >> > > >> > > > > > Hello, >> > > >> > > > > > >> > > >> > > > > > I am using client API 0.9.0.1 and facing an issue. As >> per my >> > > >> logs >> > > >> > it >> > > >> > > > > seems >> > > >> > > > > > that on each commitSync(Offsets) a heartbeat request is >> sent >> > > but >> > > >> > > after >> > > >> > > > a >> > > >> > > > > > metada refresh request till next poll(), commits do not >> send >> > > any >> > > >> > > > hearbeat >> > > >> > > > > > request. >> > > >> > > > > > >> > > >> > > > > > KafkaConsumers i create sometimes get session time out >> due >> > to >> > > no >> > > >> > > > hearbeat >> > > >> > > > > > specially during longer processing times. I call >> > > >> > CommitSync(offsets) >> > > >> > > > > after >> > > >> > > > > > regular intervals to keep session alive when processing >> > takes >> > > >> > longer >> > > >> > > > than >> > > >> > > > > > usual. Every thing works fine if commit intervals are >> very >> > > >> small or >> > > >> > > if >> > > >> > > > i >> > > >> > > > > > commit after each record but if i commit lets say every >> 12 >> > > >> seconds >> > > >> > > and >> > > >> > > > 30 >> > > >> > > > > > seconds is session time then i can see consumer getting >> > timed >> > > >> out >> > > >> > > > > > sometimes. >> > > >> > > > > > >> > > >> > > > > > Any help or pointers will be much appreciated. Thanks in >> > > >> advance. >> > > >> > > > > > >> > > >> > > > > > Regards, >> > > >> > > > > > Vinay sharma >> > > >> > > > > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > > >> > > >> > >> > >