Carrie Hazelton created KAFKA-15808:
---------------------------------------

             Summary: kstreams app consumer threads all fenced due to outdated 
static membership ids during group coordinator migration
                 Key: KAFKA-15808
                 URL: https://issues.apache.org/jira/browse/KAFKA-15808
             Project: Kafka
          Issue Type: Bug
          Components: group-coordinator, streams
    Affects Versions: 3.5.0
            Reporter: Carrie Hazelton
         Attachments: broker-50003-kafka.log, broker-55002-kafka.log, 
kstreams-55002-kafka.log

I am wondering if my team encountered a bug with static membership in a KStream 
application (river.stream-filter in attached logs). There was an abrupt 
shutdown of all of it's consumer threads after they all became fenced on 
membership id conflicts. It occurred during a rolling deployment to the Kafka 
broker fleet which triggered a migration of the group coordinator to a new 
broker (from kafka-broker-55002 to kafka-broker-50003).

My understanding is that a new group coordinator shouldn't start interacting 
with consumer threads until it has finished loading membership metadata during 
the migration. In this case, the new group coordinator started interacting with 
the thread consumers before it finished loading the latest membership 
information in the most recent generation of the membership metadata. As the 
migration was incomplete, the new group coordinator had outdated consumers 
membership IDs, which caused all the streaming app consumer threads to get 
fenced due to conflicting membership ids.

Major timeline in the attached logs:
 * 07:27:04 - 07:32:25 UTC: Deployment to broker kafka-broker-55002.
 * 07:32:07 - 07:32:10 UTC: 36 unavailable broker messages seen for 
kafka-broker-55002 and 3 for kafka-broker-50003 (overlapping in time).
 * 07:32:21,505 UTC: Controller kafka-broker-51005 adds broker 
kafka-broker-55002 back as a live broker.
 * 07:32:23,276 UTC: kafka-broker-50003 last mention of loading membership data.
 * 07:32:34,175 - 07:32:34,278 UTC - kstream app threads (consumers in group) 
using latest member IDs interact with group coordinator kafka-broker-50003 
which had outdated member IDs. Consumers threads from kstream app 
(river.stream-filter) are fenced and ejected from the consumer group.

When kafka-broker-50003 was assigned group coordinator it was loading the 
latest (737) generation membership metadata, but appears to have never 
finished. Review of a previous successful migration shows multiple log lines 
for a new generation, but only one was seen in this deployment. Below are some 
log lines that I found from a kstream host kafka.log that highlights the member 
id difference in for generation 737:

First generation and last seen log line on while the app was talking to group 
coordinator kafka-broker-50003:
{quote}
{{{{[2023-08-04 07:32:16,564] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 686. (kafka.coordinator.group.GroupMetadata$) }}}}

{{{}{{}}{}}}{{{}{{...}}{}}}

{{{{[2023-08-04 07:32:23,276] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 737. (kafka.coordinator.group.GroupMetadata$)}}}}
{quote}
The memberId was the same in both generations (ending with {{615b538a0e79)}}

The consumers got fenced after the controller sent a message around the time of 
the above log line.

Later kafka-broker-55002 gets the group coordinator assignment back. These log 
lines look similar to a previous successful migration, and the memberId changes 
(from ending with {{615b538a0e79}} to {{{}4ba4c7f17eea{}}})

 
{quote}{{[2023-08-04 07:32:44,388] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-8a5b4e68-7065-4125-9ba0-615b538a0e79,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 686. (kafka.coordinator.group.GroupMetadata$)}}

{{...}}

{{[2023-08-04 07:33:08,824] INFO Loaded member 
MemberMetadata(memberId=kafka-streams-55007.somewhere.com-2-e1dd9fd3-5887-4cd3-96a7-4ba4c7f17eea,
 groupInstanceId=Some(kafka-streams-55007.somewhere.com-2), 
clientId=river.stream-filter.v1-d3473182-6af5-410d-82ef-4b85ea1b9a66-StreamThread-2-consumer,
 clientHost=/XX.XX.XX.XX, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, 
supportedProtocols=List(stream)) in group river.stream-filter.v1 with 
generation 737. (kafka.coordinator.group.GroupMetadata$)}}
{quote}
I've attached more complete INFO kafka logs from the two group coordinators and 
a kstream app host.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to