Thanks Vahid. To clarify the impact of this issue, since we have no way to send an error code in the OffsetFetchResponse when requesting all offsets, we cannot detect when the coordinator has moved to another broker or when it is still in the process of loading the offsets. This means we cannot tell if there were was an error or if there were just no offsets stored for the group. We've considered a few options:
1. Include an error code at the top level of the response. This seems like the cleanest approach. The downside is that clients need to look for errors in two locations for response errors. One small benefit is that many OffsetFetch errors are group-level, so in that case, we can save the need to return responses for all the requested partitions. 2. Sort of hacky, but we could insert a "dummy" partition into the response so that we have somewhere to return an error code. 3. Include no error code, but use a null array in the response to indicate that there was some error. If there was no error, and the group simply had no partitions, then we return an empty array. I guess in this case, if the client receives a null array in the response, it should assume the worst and rediscover the coordinator and try again. My preference is the first one. Not sure if there are any other ideas? -Jason On Thu, Dec 15, 2016 at 3:02 PM, Vahid S Hashemian < vahidhashem...@us.ibm.com> wrote: > Hi all, > > Even though KIP-88 was recently approved, due to a limitation that comes > with the proposed protocol change in KIP-88 I'll have to re-open it to > address the problem. > I'd like to thank Jason Gustafson for catching this issue. > > I'll explain this in the KIP as well, but to summarize, KIP-88 suggests > adding the option of passing a "null" array in FetchOffset request to > query all existing offsets for a consumer group. It does not suggest any > modification to FetchOffset response. > > In the existing protocol, group or coordinator related errors are reported > along with each partition in the OffsetFetch response. > > If there are partitions in the request, they are guaranteed to appear in > the response (there could be an error code associated with each). So if > there is an error, it is reported back by being attached to some partition > in the request. > If an empty array is passed, no error is reported (no matter what the > group or coordinator status is). The response comes back with an empty > list. > > With the proposed change in KIP-88 we could have a scenario in which a > null array is sent in FetchOffset request, and due to some errors (for > example if coordinator just started and hasn't caught up yet with the > offset topic), an empty list is returned in the FetchOffset response (the > group may or may not actually be empty). The issue is in situations like > this no error can be returned in the response because there is no > partition to attach the error to. > > I'll update the KIP with more details and propose to add to OffsetFetch > response schema an "error_code" at the top level that can be used to > report group related errors (instead of reporting those errors with each > individual partition). > > I apologize if this causes any inconvenience. > > Feedback and comments are always welcome. > > Thanks. > --Vahid > >