[ 
https://issues.apache.org/jira/browse/KAFKA-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117039#comment-17117039
 ] 

Raman Gupta edited comment on KAFKA-10007 at 5/26/20, 8:59 PM:
---------------------------------------------------------------

Just happened to me again on a completely different 2.4.1 broker. The cluster 
was recently downscaled from 4 brokers to 1 and today when a client restarted, 
it had lost its offsets from 3 partitions out of 100. So it seems like just 
shutting down brokers is enough to cause this. This is a really serious issue 
and needs some attention.


was (Author: rocketraman):
Just happened to me again on a completely different 2.4.1 broker. The cluster 
was recently downscaled from 4 brokers to 1 and today when a client restarted, 
it had lost its offsets from 3 partitions out of 100.  This is a really serious 
issue and needs some attention.

> Kafka consumer offset reset despite recent group activity
> ---------------------------------------------------------
>
>                 Key: KAFKA-10007
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10007
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Raman Gupta
>            Priority: Major
>
> I was running a Kafka 2.3.0 broker with the default values for 
> `offset.retention.minutes` (which should be 7 days as of 2.0.0). I deployed a 
> 2.4.1 broker, along with a change in setting `offset.retention.minutes` to 14 
> days, as I have several low-traffic topics in which exactly-once processing 
> is desired.
> As I understand it, with https://issues.apache.org/jira/browse/KAFKA-4682 and 
> KIP-211, offsets should no longer be expired based on the last commit 
> timestamp, but instead on the last time the group transitioned into an Empty 
> state.
> However, the behavior I saw from Kafka upon broker shutdown was that the 
> offsets were expired for a group when as far as I can tell, they should not 
> have been. See these logs from during the cluster recycle -- during this time 
> the consumer, configured with the static group membership protocol, is always 
> running:
> {code}
> <<Running Kafka 2.3.0, 4 brokers, all on 2.3, protocol version 2.3, 
> offsets.retention.minutes using default value>>
> [2020-05-10 05:37:01,070] <<Shutting down kafka-0>>
> << Starting broker-0 on 2.4.1 with protocol version 2.3, 
> offsets.retention.minutes = 10080 >>
> kafka-0       [2020-05-10 05:37:39,682] INFO starting 
> (kafka.server.KafkaServer)
> kafka-0       [2020-05-10 05:39:42,680] INFO [GroupCoordinator 0]: Loading 
> group metadata for produs-cis-CisFileEventConsumer with generation 27 
> (kafka.coordinator.group.GroupCoordinator)
> << Recycling broker-1 on 2.4.1, protocol version 2.3, 
> offsets.retention.minutes = 10080, looks like the consumer fails because of 
> the broker going down, and kafka-0 reports: >>
> kafka-0       [2020-05-10 05:45:14,121] INFO [GroupCoordinator 0]: Member 
> cis-9c5d994c5-7hpqt-efced5ca-0b81-4720-992d-bdd8612519b3 in group 
> produs-cis-CisFileEventConsumer has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
> kafka-0       [2020-05-10 05:45:14,124] INFO [GroupCoordinator 0]: Preparing 
> to rebalance group produs-cis-CisFileEventConsumer in state 
> PreparingRebalance with old generation 27 (__consumer_offsets-17) (reason: 
> removing member cis-9c5d994c5-7hpqt-efced5ca-0b81-4720-992d-bdd8612519b3 on 
> heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
> kafka-0       [2020-05-10 05:45:19,479] INFO [GroupCoordinator 0]: Member 
> cis-9c5d994c5-sknlk-2b9ed8bf-348c-4a10-97d3-5f2caccce7df in group 
> produs-cis-CisFileEventConsumer has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
> kafka-0       [2020-05-10 05:45:19,482] INFO [GroupCoordinator 0]: Group 
> produs-cis-CisFileEventConsumer with generation 28 is now empty 
> (__consumer_offsets-17) (kafka.coordinator.group.GroupCoordinator)
> << and now kafka-1 starts up again, the offsets are expired >>
> kafka-1       [2020-05-10 05:46:11,229] INFO starting 
> (kafka.server.KafkaServer)
> ...
> kafka-0       [2020-05-10 05:47:42,303] INFO [GroupCoordinator 0]: Preparing 
> to rebalance group produs-cis-CisFileEventConsumer in state 
> PreparingRebalance with old generation 28 (__consumer_offsets-17) (reason: 
> Adding new member cis-9c5d994c5-sknlk-1194b4b6-81ae-4a78-89a7-c610cf8c65be 
> with group instanceid Some(cis-9c5d994c5-sknlk)) 
> (kafka.coordinator.group.GroupCoordinator)
> kafka-0       [2020-05-10 05:47:47,611] INFO [GroupMetadataManager 
> brokerId=0] Removed 43 expired offsets in 13 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> kafka-0       [2020-05-10 05:48:12,308] INFO [GroupCoordinator 0]: Stabilized 
> group produs-cis-CisFileEventConsumer generation 29 (__consumer_offsets-17) 
> (kafka.coordinator.group.GroupCoordinator)
> kafka-0       [2020-05-10 05:48:12,311] INFO [GroupCoordinator 0]: Assignment 
> received from leader for group produs-cis-CisFileEventConsumer for generation 
> 29 (kafka.coordinator.group.GroupCoordinator)
> {code}
> The group becomes empty at 2020-05-10 05:45:19,482, and then the offsets are 
> expired about two minutes later at 05:47:47,611. I can't see any reason based 
> on my understanding of how things work for this to have happened, other than 
> it being a bug of some type?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to