[ https://issues.apache.org/jira/browse/KAFKA-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988022#comment-16988022 ]
Sophie Blee-Goldman commented on KAFKA-9268: -------------------------------------------- Do you have any thoughts (or guesses) at how far back this issue is present? It seems to affect every version going backwards that we've tested, and sounds more like an issue that would have been present from the beginning than a regression that was introduced. Not saying we should fix it for all versions going back forever, but we should at least be clear on the ticket about which versions we expect to be affected if it does seem highly likely to affect all versions. > Follow-on: Streams Threads may die from recoverable errors with EOS enabled > --------------------------------------------------------------------------- > > Key: KAFKA-9268 > URL: https://issues.apache.org/jira/browse/KAFKA-9268 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.2.0 > Reporter: John Roesler > Priority: Major > Fix For: 2.4.0 > > Attachments: 2.2-eos-failures-1.txt, 2.2-eos-failures-2.txt > > > While testing Streams in EOS mode under frequent and heavy network > partitions, I've encountered exceptions leading to thread death in both 2.2 > and 2.3 (although different exceptions). > I believe this problem is addressed in 2.4+ by > https://issues.apache.org/jira/browse/KAFKA-9231 , however, if you look at > the ticket and corresponding PR, you will see that the solution there > introduced some tech debt around UnknownProducerId that needs to be cleaned > up. Therefore, I'm not backporting that fix to older branches. Rather, I'm > opening a new ticket to make more conservative changes in older branches to > improve resilience, if desired. > These failures are relative rare, so I don't think that a system or > integration test could reliably reproduce it. The steps to reproduce would be: > 1. set up a long-running Streams application with EOS enabled (I used three > Streams instances) > 2. inject periodic network partitions (I had each Streams instance schedule > an interruption at a random time between 0 and 3 hours, then schedule the > interruption to last a random duration between 0 and 5 minutes. The > interruptions are accomplished by using iptables to drop all traffic to/from > all three brokers) > As far as the actual errors I've observed, I'm attaching the logs of two > incidents in which a thread was caused to shut down. -- This message was sent by Atlassian Jira (v8.3.4#803005)