[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696602#comment-14696602
 ] 

Ewen Cheslack-Postava commented on KAFKA-2397:
----------------------------------------------

[~becket_qin] I was more worried about figuring out what behavior was 
preferable first, then figuring out how to make it work with our code. I 
realize we'd need to expose some events in the lower level up to the code 
layered on it, but I don't see anything wrong with doing that, it just requires 
a tracking some more state and relaying events as you described. Kicking the 
member out based on TCP disconnection seemed to cover more cases, so unless 
there was a problem with it, I figured it's worth the effort to try to make it 
work that way.

Any system tests that forcibly kill Copycat workers are going to have the same 
issues I'm running into now. That isn't a huge problem since it's ok for some 
tests to take a long time, but it does have other impacts as well; for example, 
that means that a crashed process will hold on to any assignments it has for up 
to the full session timeout, in which case those assignments will not be 
processed (which, for Copycat, could potentially mean 30s worth of data lost if 
the source data is ephemeral, such as metrics).

[~hachikuji] I thought about proxies, but I couldn't come up with a scenario 
where the TCP connection to the coordinator would be closed do to a very short 
transient issue. But I definitely won't claim I know that it will never be the 
case or that I know all the weird things proxies might do under a variety of 
scenarios or configurations...

One problem with requiring an explicit leave group request/flag is that any 
crash still takes a lot of time to free up assigned partitions and keeps any 
members who are behaving properly from continuing to process their assigned 
work (since they discover the need for rebalance, invoke the rebalance revoked 
callback, and join group immediately). This means any process that crashes can 
gum up the works for all the other processes. And some people prefer the fail 
fast, crash and recover by restarting the process approach, so while they would 
obviously prefer crashes not happen, they also might expect to encounter this 
scenario semi-regularly and then find things grinding to a halt for 30s at a 
time.

> leave group request
> -------------------
>
>                 Key: KAFKA-2397
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2397
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
>            Priority: Minor
>             Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to