[ https://issues.apache.org/jira/browse/KAFKA-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236348#comment-15236348 ]
Jason Gustafson commented on KAFKA-3488: ---------------------------------------- [~guozhang] Sorry for the late response. I thought about that issue too. I don't see any problems for fetches since we currently only allow one pending request per connection. Maybe there are cases where we end up unnecessarily retrying a failed fetch after a position has been updated, but usually we'd expect the fetch to be sent quickly or fail from a disconnect, so it doesn't feel like a case worth optimizing for. For requests to the coordinator, it might be possible for a bunch of asynchronous commits could pile up and prevent a needed join group from being sent. This situation is hard to avoid generally even without this patch because offset commits are so small and many can be sent at once. I think it's probably unlikely to hit this in practice unless you're sending commits at a high rate. I was more concerned about requests to the old coordinator being sent after we had already failed over to the new coordinator, but the patch included logic to cancel unsent requests whenever the coordinator is marked dead. Are there any other concerns? > commitAsync() fails if metadata update creates new SASL/SSL connection > ---------------------------------------------------------------------- > > Key: KAFKA-3488 > URL: https://issues.apache.org/jira/browse/KAFKA-3488 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.9.0.1 > Reporter: Rajini Sivaram > Assignee: Rajini Sivaram > Fix For: 0.10.1.0, 0.10.0.0 > > > Sasl/SslConsumerTest.testSimpleConsumption() fails intermittently with a > failure in {{commitAsync()}}. The exception stack trace shows: > {quote} > kafka.api.SaslPlaintextConsumerTest.testSimpleConsumption FAILED > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > kafka.api.BaseConsumerTest.awaitCommitCallback(BaseConsumerTest.scala:340) > at > kafka.api.BaseConsumerTest.testSimpleConsumption(BaseConsumerTest.scala:85) > {quote} > I have recreated this with some additional trace. The tests run with a very > small metadata expiry interval, triggering metadata updates quite often. If a > metadata request immediately following a {{commitAsync()}} call creates a new > SSL/SASL connection, {{ConsumerNetworkClient.poll}} returns to process the > connection handshake packets. Since {{ConsumerNetworkClient.poll}} discards > all unsent packets before returning from poll, this can result in the failure > of the commit - the callback is invoked with {{SendFailedException}}. > I understand that {{ConsumerNetworkClient.poll()}} discards unsent packets > rather than buffer them to keep the code simple. And perhaps it is ok to fail > {{commitAsync}} occasionally since the callback does indicate that the caller > should retry. But it feels like an unnecessary limitation that requires error > handling in client applications when there are no real failures and makes it > much harder to test reliably. As special handling to fix issues like > KAFKA-3412, KAFKA-2672 adds more complexity to the code anyway, and because > it is much harder to debug failures that affect only SSL/SASL, it may be > worth considering improving this behaviour. > I will see if I can submit a PR for the specific issue I was seeing with the > impact of handshakes on {{commitAsync()}}, but I will be interested in views > on improving the logic in {{ConsumerNetworkClient}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)