[ https://issues.apache.org/jira/browse/KAFKA-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17117039#comment-17117039 ]
Raman Gupta edited comment on KAFKA-10007 at 5/26/20, 8:59 PM: --------------------------------------------------------------- Just happened to me again on a completely different 2.4.1 broker. The cluster was recently downscaled from 4 brokers to 1 and today when a client restarted, it had lost its offsets from 3 partitions out of 100. So it seems like just shutting down brokers is enough to cause this. This is a really serious issue and needs some attention. was (Author: rocketraman): Just happened to me again on a completely different 2.4.1 broker. The cluster was recently downscaled from 4 brokers to 1 and today when a client restarted, it had lost its offsets from 3 partitions out of 100. This is a really serious issue and needs some attention. > Kafka consumer offset reset despite recent group activity > --------------------------------------------------------- > > Key: KAFKA-10007 > URL: https://issues.apache.org/jira/browse/KAFKA-10007 > Project: Kafka > Issue Type: Bug > Reporter: Raman Gupta > Priority: Major > > I was running a Kafka 2.3.0 broker with the default values for > `offset.retention.minutes` (which should be 7 days as of 2.0.0). I deployed a > 2.4.1 broker, along with a change in setting `offset.retention.minutes` to 14 > days, as I have several low-traffic topics in which exactly-once processing > is desired. > As I understand it, with https://issues.apache.org/jira/browse/KAFKA-4682 and > KIP-211, offsets should no longer be expired based on the last commit > timestamp, but instead on the last time the group transitioned into an Empty > state. > However, the behavior I saw from Kafka upon broker shutdown was that the > offsets were expired for a group when as far as I can tell, they should not > have been. See these logs from during the cluster recycle -- during this time > the consumer, configured with the static group membership protocol, is always > running: > {code} > <<Running Kafka 2.3.0, 4 brokers, all on 2.3, protocol version 2.3, > offsets.retention.minutes using default value>> > [2020-05-10 05:37:01,070] <<Shutting down kafka-0>> > << Starting broker-0 on 2.4.1 with protocol version 2.3, > offsets.retention.minutes = 10080 >> > kafka-0 [2020-05-10 05:37:39,682] INFO starting > (kafka.server.KafkaServer) > kafka-0 [2020-05-10 05:39:42,680] INFO [GroupCoordinator 0]: Loading > group metadata for produs-cis-CisFileEventConsumer with generation 27 > (kafka.coordinator.group.GroupCoordinator) > << Recycling broker-1 on 2.4.1, protocol version 2.3, > offsets.retention.minutes = 10080, looks like the consumer fails because of > the broker going down, and kafka-0 reports: >> > kafka-0 [2020-05-10 05:45:14,121] INFO [GroupCoordinator 0]: Member > cis-9c5d994c5-7hpqt-efced5ca-0b81-4720-992d-bdd8612519b3 in group > produs-cis-CisFileEventConsumer has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > kafka-0 [2020-05-10 05:45:14,124] INFO [GroupCoordinator 0]: Preparing > to rebalance group produs-cis-CisFileEventConsumer in state > PreparingRebalance with old generation 27 (__consumer_offsets-17) (reason: > removing member cis-9c5d994c5-7hpqt-efced5ca-0b81-4720-992d-bdd8612519b3 on > heartbeat expiration) (kafka.coordinator.group.GroupCoordinator) > kafka-0 [2020-05-10 05:45:19,479] INFO [GroupCoordinator 0]: Member > cis-9c5d994c5-sknlk-2b9ed8bf-348c-4a10-97d3-5f2caccce7df in group > produs-cis-CisFileEventConsumer has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > kafka-0 [2020-05-10 05:45:19,482] INFO [GroupCoordinator 0]: Group > produs-cis-CisFileEventConsumer with generation 28 is now empty > (__consumer_offsets-17) (kafka.coordinator.group.GroupCoordinator) > << and now kafka-1 starts up again, the offsets are expired >> > kafka-1 [2020-05-10 05:46:11,229] INFO starting > (kafka.server.KafkaServer) > ... > kafka-0 [2020-05-10 05:47:42,303] INFO [GroupCoordinator 0]: Preparing > to rebalance group produs-cis-CisFileEventConsumer in state > PreparingRebalance with old generation 28 (__consumer_offsets-17) (reason: > Adding new member cis-9c5d994c5-sknlk-1194b4b6-81ae-4a78-89a7-c610cf8c65be > with group instanceid Some(cis-9c5d994c5-sknlk)) > (kafka.coordinator.group.GroupCoordinator) > kafka-0 [2020-05-10 05:47:47,611] INFO [GroupMetadataManager > brokerId=0] Removed 43 expired offsets in 13 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > kafka-0 [2020-05-10 05:48:12,308] INFO [GroupCoordinator 0]: Stabilized > group produs-cis-CisFileEventConsumer generation 29 (__consumer_offsets-17) > (kafka.coordinator.group.GroupCoordinator) > kafka-0 [2020-05-10 05:48:12,311] INFO [GroupCoordinator 0]: Assignment > received from leader for group produs-cis-CisFileEventConsumer for generation > 29 (kafka.coordinator.group.GroupCoordinator) > {code} > The group becomes empty at 2020-05-10 05:45:19,482, and then the offsets are > expired about two minutes later at 05:47:47,611. I can't see any reason based > on my understanding of how things work for this to have happened, other than > it being a bug of some type? -- This message was sent by Atlassian Jira (v8.3.4#803005)