[ https://issues.apache.org/jira/browse/KAFKA-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916235#comment-17916235 ]
Luke Kirby edited comment on KAFKA-8202 at 1/23/25 2:29 AM: ------------------------------------------------------------ Following the PR trail there, it appears that there was some thought that perhaps[ this commit |https://github.com/apache/kafka/commit/e4215c17846d8790712ee1764f3f852d99d3fc3a]might have fixed this issue, though this ticket was left open. We seemed to have just encountered this in the wild in 3.4.0. {code:java} java.lang.StackOverflowError: org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) ...{code} The logs revealed a similar pattern of MESSAGE_TOO_LARGE errors on incrementing correlation IDs and the same number of retries. On further review, the root issue seems to be that identified in KAFKA-8350 – messages that will never actually be splittable owing to, perhaps, an impossibly small max message size on the destination topic. That manifesting in stack oveerflows is additionally problematic, however. was (Author: JIRAUSER302550): Following the PR trail there, it appears that there was some thought that perhaps[ this commit |https://github.com/apache/kafka/commit/e4215c17846d8790712ee1764f3f852d99d3fc3a]might have fixed this issue, though this ticket was left open. We seemed to have just encountered this in the wild in 3.4.0. {code:java} java.lang.StackOverflowError: org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:91) ...{code} The logs revealed a similar pattern of MESSAGE_TOO_LARGE errors on incrementing correlation IDs and the same number of retries. > StackOverflowError on producer when splitting batches > ----------------------------------------------------- > > Key: KAFKA-8202 > URL: https://issues.apache.org/jira/browse/KAFKA-8202 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Daniel Krawczyk > Assignee: Zhanxiang (Patrick) Huang > Priority: Major > > Hello, > recently we came across a StackOverflowError error in the Kafka producer java > library. The error caused the Kafka producer to stop (we had to restart our > service due to: IllegalStateException: Cannot perform operation after > producer has been closed). > The stack trace was as follows: > {code:java} > java.lang.StackOverflowError: null > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > at > org.apache.kafka.clients.producer.internals.FutureRecordMetadata.chain(FutureRecordMetadata.java:89) > // […] > {code} > The piece of code responsible for the error: > {code:java} > /** > * This method is used when we have to split a large batch in smaller ones. A > chained metadata will allow the > * future that has already returned to the users to wait on the newly created > split batches even after the > * old big batch has been deemed as done. > */ > void chain(FutureRecordMetadata futureRecordMetadata) { > if (nextRecordMetadata == null) > nextRecordMetadata = futureRecordMetadata; > else > nextRecordMetadata.chain(futureRecordMetadata); > } > {code} > Before the error occurred we observed large amount of logs related to record > batches being split (caused by MESSAGE_TOO_LARGE error) on one of our topics > (logged by org.apache.kafka.clients.producer.internals.Sender): > {code:java} > [Producer clientId=producer-1] Got error produce response in correlation id > 158621342 on topic-partition <topic name>, splitting and retrying (2147483647 > attempts left). Error: MESSAGE_TOO_LARGE > {code} > All logs had different correlation ids, but the same counters of attempts > left (2147483647), so it looked like they were related to different requests > and all of them were succeeding with no further retries. > We are using kafka-clients java library in version 2.0.0, the brokers are > 2.1.1. > Thanks in advance. -- This message was sent by Atlassian Jira (v8.20.10#820010)