I should also mention that this error was seen on broker version 0.10.1.1. I found that this condition sounds somewhat similar to KAFKA-4362 <https://issues.apache.org/jira/browse/KAFKA-4362>, but that issue was submitted in 0.10.1.1 so they appear to be different issues.
On Wed, Mar 15, 2017 at 11:11 AM, Robert Quinlivan <rquinli...@signal.co> wrote: > Good morning, > > I'm hoping for some help understanding the expected behavior for an offset > commit request and why this request might fail on the broker. > > *Context:* > > For context, my configuration looks like this: > > - Three brokers > - Consumer offsets topic replication factor set to 3 > - Auto commit enabled > - The user application topic, which I will call "my_topic", has a > replication factor of 3 as well and 800 partitions > - 4000 consumers attached in consumer group "my_group" > > > *Issue:* > > When I attach the consumers, the coordinator logs the following error > message repeatedly for each generation: > > ERROR [Group Metadata Manager on Broker 0]: Appending metadata message for > group my_group generation 2066 failed due to org.apache.kafka.common. > errors.RecordTooLargeException, returning UNKNOWN error code to the > client (kafka.coordinator.GroupMetadataManager) > > *Observed behavior:* > > The consumer group does not stay connected long enough to consume > messages. It is effectively stuck in a rebalance loop and the "my_topic" > data has become unavailable. > > > *Investigation:* > > Following the Group Metadata Manager code, it looks like the broker is > writing to a cache after it writes an Offset Commit Request to the log > file. If this cache write fails, the broker then logs this error and > returns an error code in the response. In this case, the error from the > cache is MESSAGE_TOO_LARGE, which is logged as a RecordTooLargeException. > However, the broker then sets the error code to UNKNOWN on the Offset > Commit Response. > > It seems that the issue is the size of the metadata in the Offset Commit > Request. I have the following questions: > > 1. What is the size limit for this request? Are we exceeding the size > which is causing this request to fail? > 2. If this is an issue with metadata size, what would cause abnormally > large metadata? > 3. How is this cache used within the broker? > > > Thanks in advance for any insights you can provide. > > Regards, > Robert Quinlivan > Software Engineer, Signal > -- Robert Quinlivan Software Engineer, Signal