Good morning, I'm hoping for some help understanding the expected behavior for an offset commit request and why this request might fail on the broker.
*Context:* For context, my configuration looks like this: - Three brokers - Consumer offsets topic replication factor set to 3 - Auto commit enabled - The user application topic, which I will call "my_topic", has a replication factor of 3 as well and 800 partitions - 4000 consumers attached in consumer group "my_group" *Issue:* When I attach the consumers, the coordinator logs the following error message repeatedly for each generation: ERROR [Group Metadata Manager on Broker 0]: Appending metadata message for group my_group generation 2066 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.GroupMetadataManager) *Observed behavior:* The consumer group does not stay connected long enough to consume messages. It is effectively stuck in a rebalance loop and the "my_topic" data has become unavailable. *Investigation:* Following the Group Metadata Manager code, it looks like the broker is writing to a cache after it writes an Offset Commit Request to the log file. If this cache write fails, the broker then logs this error and returns an error code in the response. In this case, the error from the cache is MESSAGE_TOO_LARGE, which is logged as a RecordTooLargeException. However, the broker then sets the error code to UNKNOWN on the Offset Commit Response. It seems that the issue is the size of the metadata in the Offset Commit Request. I have the following questions: 1. What is the size limit for this request? Are we exceeding the size which is causing this request to fail? 2. If this is an issue with metadata size, what would cause abnormally large metadata? 3. How is this cache used within the broker? Thanks in advance for any insights you can provide. Regards, Robert Quinlivan Software Engineer, Signal