[ 
https://issues.apache.org/jira/browse/KAFKA-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980895#comment-14980895
 ] 

Jiangjie Qin commented on KAFKA-2017:
-------------------------------------

[~guozhang] I think the approach works. It might be a little bit tight in 
schedule, though.

We may also need to enforce session timeout to be greater than request timeout. 
If the coordinator has hard failure. The consumers will still send 
HeartBeatRequest to the failed coordinator but won't receive any 
HeartbeatResponse. It continues until request timeout. So if session timeout is 
smaller than request timeout (which is the current setting now), consumers 
might be kicked out of the group and still have issue with committing offsets.

Just want to make sure we considered all the alternatives, in terms of (1), my 
original understanding is that it actually sort of persists group information 
in consumers themselves. The idea is that when coordinator fails over, the 
consumers will eventually talk to the new coordinator through some kind of 
requests, so the new coordinator just need to silently collect the information 
from consumers. If the coordinator receive Hearbeat or Offsetcommit from an 
unknown group id or unknown consumer, it infers the group is in stable state. 
We simply accept them if the group is unknown and record the information of the 
consumer id, group id and generation id. For subsequent requests from 
consumers, as long as the generation Id matches, coordinator just add them to 
the group. (That will make the consumer id essentially less useful but this is 
the problem we already have today, i.e. user will either always receive 
UnknownConsumerIdException or IllegalGenerationIdException.) We might need 
think a bit more about what if new coordinator receives JoingGroupRequest or 
SyncGroupRequest as the first request of an unknown group or consumer. I am not 
sure if this would work or not, but might be an option.

The caveat is that if the coordinator and the consumer failed at the same time, 
no rebalance will be triggered by the new coordinator because the new 
coordinator depends on the consumers periodical requests to recover group 
information. Also describe group won't work because the assignment information 
is not available unless we let the consumers to send metadata again.


> Persist Coordinator State for Coordinator Failover
> --------------------------------------------------
>
>                 Key: KAFKA-2017
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2017
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>    Affects Versions: 0.9.0.0
>            Reporter: Onur Karaman
>            Assignee: Guozhang Wang
>            Priority: Blocker
>             Fix For: 0.9.0.0
>
>         Attachments: KAFKA-2017.patch, KAFKA-2017_2015-05-20_09:13:39.patch, 
> KAFKA-2017_2015-05-21_19:02:47.patch
>
>
> When a coordinator fails, the group membership protocol tries to failover to 
> a new coordinator without forcing all the consumers rejoin their groups. This 
> is possible if the coordinator persists its state so that the state can be 
> transferred during coordinator failover. This state consists of most of the 
> information in GroupRegistry and ConsumerRegistry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to