[ https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980895#comment-14980895 ]
Jiangjie Qin commented on KAFKA-2017: ------------------------------------- [~guozhang] I think the approach works. It might be a little bit tight in schedule, though. We may also need to enforce session timeout to be greater than request timeout. If the coordinator has hard failure. The consumers will still send HeartBeatRequest to the failed coordinator but won't receive any HeartbeatResponse. It continues until request timeout. So if session timeout is smaller than request timeout (which is the current setting now), consumers might be kicked out of the group and still have issue with committing offsets. Just want to make sure we considered all the alternatives, in terms of (1), my original understanding is that it actually sort of persists group information in consumers themselves. The idea is that when coordinator fails over, the consumers will eventually talk to the new coordinator through some kind of requests, so the new coordinator just need to silently collect the information from consumers. If the coordinator receive Hearbeat or Offsetcommit from an unknown group id or unknown consumer, it infers the group is in stable state. We simply accept them if the group is unknown and record the information of the consumer id, group id and generation id. For subsequent requests from consumers, as long as the generation Id matches, coordinator just add them to the group. (That will make the consumer id essentially less useful but this is the problem we already have today, i.e. user will either always receive UnknownConsumerIdException or IllegalGenerationIdException.) We might need think a bit more about what if new coordinator receives JoingGroupRequest or SyncGroupRequest as the first request of an unknown group or consumer. I am not sure if this would work or not, but might be an option. The caveat is that if the coordinator and the consumer failed at the same time, no rebalance will be triggered by the new coordinator because the new coordinator depends on the consumers periodical requests to recover group information. Also describe group won't work because the assignment information is not available unless we let the consumers to send metadata again. > Persist Coordinator State for Coordinator Failover > -------------------------------------------------- > > Key: KAFKA-2017 > URL: https://issues.apache.org/jira/browse/KAFKA-2017 > Project: Kafka > Issue Type: Sub-task > Components: consumer > Affects Versions: 0.9.0.0 > Reporter: Onur Karaman > Assignee: Guozhang Wang > Priority: Blocker > Fix For: 0.9.0.0 > > Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, > KAFKA-2017_2015-05-21_19:02:47.patch > > > When a coordinator fails, the group membership protocol tries to failover to > a new coordinator without forcing all the consumers rejoin their groups. This > is possible if the coordinator persists its state so that the state can be > transferred during coordinator failover. This state consists of most of the > information in GroupRegistry and ConsumerRegistry. -- This message was sent by Atlassian JIRA (v6.3.4#6332)