guozhangwang commented on pull request #11340:
URL: https://github.com/apache/kafka/pull/11340#issuecomment-948136951


   @RivenSun2 I talked to @hachikuji offline about the best options to fix it 
in near term, and we feel the async-commit approach may be more appropriate 
here. But you'd need to be careful about not just trying once and give up 
immediately to continue the rebalance. Here's the status quo:
   
   * We would try commit upto the configured rebalance.timeout, and if we 
exhaust that timeout but still cannot succeed (like in this case, keep getting 
unknown topic partition error), we would just log it and continue the rebalance.
   * Note that we have a flag `needsJoinPrepare` in AbstractCoordinator which 
is set before the `onJoinPrepare` call, which means that if the call itself 
throws out error, upon the next `poll` we would not try to trigger 
`onJoinPrepare` again.
   
   So to make async-commit work, here's a rough sketch of what we'd need to do:
   
   * We keep a reference of the last commit response future sent as part of the 
`onJoinPrepare`.
   * In `maybeAutoCommitOffsetsSync`, as we would rename it to 
`maybeAutoCommitOffsetsAsync`, we check if the response future is `null` or 
not; if it is `null` we just send out the request and get hold on the `future`. 
And then we call the networkClient.poll once and see if the `future` is 
completed. If yes and there's no error, we return `true` from 
`maybeAutoCommitOffsetsAsync` indicating it has suceeded, otherwise we return 
`false`.
   * When `maybeAutoCommitOffsetsAsync` returns false, the `onJoinPrepare` 
would return false immediately as well, and the caller would then reset the 
`needsJoinPrepare` flag so that next time it would still trigger 
`onJoinPrepare`. And then return to the `poll` call.
   
   By doing that, the `poll` call would not block on commit, but would return 
immediately after just one trial of the commit request, and the user may 
potentially call `poll` multiple times in order to complete the commit as part 
of the `onJoinPrepare` to continue the rebalance, but it would help resolving 
the longer than `poll` timeout blocking issues. As for the backing off, let's 
delegate that to KIP-580.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to