[ https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696602#comment-14696602 ]
Ewen Cheslack-Postava commented on KAFKA-2397: ---------------------------------------------- [~becket_qin] I was more worried about figuring out what behavior was preferable first, then figuring out how to make it work with our code. I realize we'd need to expose some events in the lower level up to the code layered on it, but I don't see anything wrong with doing that, it just requires a tracking some more state and relaying events as you described. Kicking the member out based on TCP disconnection seemed to cover more cases, so unless there was a problem with it, I figured it's worth the effort to try to make it work that way. Any system tests that forcibly kill Copycat workers are going to have the same issues I'm running into now. That isn't a huge problem since it's ok for some tests to take a long time, but it does have other impacts as well; for example, that means that a crashed process will hold on to any assignments it has for up to the full session timeout, in which case those assignments will not be processed (which, for Copycat, could potentially mean 30s worth of data lost if the source data is ephemeral, such as metrics). [~hachikuji] I thought about proxies, but I couldn't come up with a scenario where the TCP connection to the coordinator would be closed do to a very short transient issue. But I definitely won't claim I know that it will never be the case or that I know all the weird things proxies might do under a variety of scenarios or configurations... One problem with requiring an explicit leave group request/flag is that any crash still takes a lot of time to free up assigned partitions and keeps any members who are behaving properly from continuing to process their assigned work (since they discover the need for rebalance, invoke the rebalance revoked callback, and join group immediately). This means any process that crashes can gum up the works for all the other processes. And some people prefer the fail fast, crash and recover by restarting the process approach, so while they would obviously prefer crashes not happen, they also might expect to encounter this scenario semi-regularly and then find things grinding to a halt for 30s at a time. > leave group request > ------------------- > > Key: KAFKA-2397 > URL: https://issues.apache.org/jira/browse/KAFKA-2397 > Project: Kafka > Issue Type: Sub-task > Components: consumer > Reporter: Onur Karaman > Assignee: Onur Karaman > Priority: Minor > Fix For: 0.8.3 > > > Let's say every consumer in a group has session timeout s. Currently, if a > consumer leaves the group, the worst case time to stabilize the group is 2s > (s to detect the consumer failure + s for the rebalance window). If a > consumer instead can declare they are leaving the group, the worst case time > to stabilize the group would just be the s associated with the rebalance > window. > This is a low priority optimization! -- This message was sent by Atlassian JIRA (v6.3.4#6332)