Hey Becket, 1. I agree that this is a problem. I think this ended up this way because there are two ways a TimeoutException can be thrown, the server can timeout or we can timeout waiting for memory. I think the complaint at the time was that it was annoying that you needed to write the same exception handling logic in two places. That is annoying but I think you are right that we should probably make it be the case that the callbacks are always in order by partition as that is a bigger problem.
2. This is by design but we argued it both ways. The question is what are you batching for? If you are batching to avoid small network sends and amortize request overhead then the current approach makes the most sense, if you are batching to reduce file writes then your more conservative approach makes more sense. By opportunistically piggybacking on an existing request you kind of get to send these records for free, but if you waited maybe you would have accumulated more data for those partitions to justify a send anyway with more data for each write. Philosophically you could argue for either. I think the case the current behavior might be suboptimal for is where you have lots of partitions, a really high linger.ms, and a fair amount of skew, and so as a result each full partition triggers a send for smaller partitions. A starting point for optimization would be to measure this bad case and figure out how much opportunity there is for improvement (say 500 partitions with strong skew to one or two of them). Our benchmarks tend to skip this as they fairly distribute load. If the opportunity is large here we can discuss if it's worth trying to fix and what gets worse. -Jay On Mon, May 18, 2015 at 3:44 PM, Jiangjie Qin <j...@linkedin.com.invalid> wrote: > Hi, > > I have two questions when writing patch for KAFKA-2142 and needs some help: > > 1. In send(), we actually might call callback.onCompletion(), this > might break the guarantee that callbacks are fired in sending order of a > particular partition. > 2. In RecordAccumulator, current logic is that if one batch to a broker > is ready (full or reaches linger.ms), we drain all the batches to the > same broker even if other batches are not full or reaching linger.ms yet. > Guozhang mentioned that this is by design, but it seems different from the > java doc and might have negative effect on batching. Was there any concern > on this? > > Thanks. > > Jiangjie (Becket) Qin > >