----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29379/#review66942 -----------------------------------------------------------
Thinking about this more, I think this patch only covers one very specific case of the bug. Since we only check for timeouts when the leader lookup returns null, this only handles the case where we're unable to update metadata (which is the only reason we would consistently be missing leader info). It doesn't handle other important cases, e.g. we know the leader but can't send for a long time because we got disconnected and can't reconnect or we're still connected but requests are getting through too slowly so the batch sits in the queue for a long time. I think it might make sense to restructure this a bit to address those cases and handle the issue with the missing stats/completeBatch call. How about always checking expiration as we're iterating through all these items, including it in the ReadyCheckResult return value as a collection like we do with readyNodes, and then the caller, Sender, can use completeBatch() to clean up? This avoids some duplicate code, gets the error stats right, and makes sure batches are always considered for expiration so it should completely solve the problem. You might need to add a while() loop to pull off all the expired batches and then use the existing code to process the next remaining batch (if there is one left). Another issue: the current patch relies on ready() being called frequently for the batches to be removed promptly after they expire. However, there are conditions where poll() will be called with large timeouts, potentially up to 5 minutes using the default settings. If we followed the approach described above, we'd probably need to track another value similar to nextReadyCheckDelayMs which would indicate when we would next need to wake up to expire the batch with the earliest expiration time. The fixes you made did address the problems. It looks like this no longer applies to trunk due to 50b734690, so it'll need rebasing. - Ewen Cheslack-Postava On Jan. 6, 2015, 6:44 p.m., Parth Brahmbhatt wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29379/ > ----------------------------------------------------------- > > (Updated Jan. 6, 2015, 6:44 p.m.) > > > Review request for kafka. > > > Bugs: KAFKA-1788 > https://issues.apache.org/jira/browse/KAFKA-1788 > > > Repository: kafka > > > Description > ------- > > Merge remote-tracking branch 'origin/trunk' into KAFKA-1788 > > > KAFKA-1788: addressed Ewen's comments. > > > Diffs > ----- > > clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java > f61efb35db7e0de590556e6a94a7b5cb850cdae9 > clients/src/main/java/org/apache/kafka/clients/producer/ProducerConfig.java > a893d88c2f4e21509b6c70d6817b4b2cdd0fd657 > > clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java > c15485d1af304ef53691d478f113f332fe67af77 > > clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordBatch.java > dd0af8aee98abed5d4a0dc50989e37888bb353fe > > clients/src/test/java/org/apache/kafka/clients/producer/RecordAccumulatorTest.java > 2c9932401d573549c40f16fda8c4e3e11309cb85 > clients/src/test/java/org/apache/kafka/clients/producer/SenderTest.java > ef2ca65cabe97b909f17b62027a1bb06827e88fe > > Diff: https://reviews.apache.org/r/29379/diff/ > > > Testing > ------- > > Unit test added. > > > Thanks, > > Parth Brahmbhatt > >