[ https://issues.apache.org/jira/browse/KAFKA-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988063#comment-16988063 ]
Guozhang Wang commented on KAFKA-9268: -------------------------------------- [~ableegoldman] while reviewing the PR for KAFKA-9231 I have some thoughts about the possible causes that `unknown producer id` can kill a thread (more details on the PR comment), and now I think it is not a regression since 1) producer transaction manager code did not change since day 1 of EOS introduced 2) streams code did not try to handle `unknown producer id` ever. With KIP-360 broker would not return this error any more but that would only be in newer versions (2.5+) so I think it is still on Streams to gracefully capture and handle it when talking to older brokers. > Follow-on: Streams Threads may die from recoverable errors with EOS enabled > --------------------------------------------------------------------------- > > Key: KAFKA-9268 > URL: https://issues.apache.org/jira/browse/KAFKA-9268 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.2.0 > Reporter: John Roesler > Priority: Major > Fix For: 2.4.0 > > Attachments: 2.2-eos-failures-1.txt, 2.2-eos-failures-2.txt > > > While testing Streams in EOS mode under frequent and heavy network > partitions, I've encountered exceptions leading to thread death in both 2.2 > and 2.3 (although different exceptions). > I believe this problem is addressed in 2.4+ by > https://issues.apache.org/jira/browse/KAFKA-9231 , however, if you look at > the ticket and corresponding PR, you will see that the solution there > introduced some tech debt around UnknownProducerId that needs to be cleaned > up. Therefore, I'm not backporting that fix to older branches. Rather, I'm > opening a new ticket to make more conservative changes in older branches to > improve resilience, if desired. > These failures are relative rare, so I don't think that a system or > integration test could reliably reproduce it. The steps to reproduce would be: > 1. set up a long-running Streams application with EOS enabled (I used three > Streams instances) > 2. inject periodic network partitions (I had each Streams instance schedule > an interruption at a random time between 0 and 3 hours, then schedule the > interruption to last a random duration between 0 and 5 minutes. The > interruptions are accomplished by using iptables to drop all traffic to/from > all three brokers) > As far as the actual errors I've observed, I'm attaching the logs of two > incidents in which a thread was caused to shut down. -- This message was sent by Atlassian Jira (v8.3.4#803005)