[ https://issues.apache.org/jira/browse/KAFKA-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Roesler updated KAFKA-9268: -------------------------------- Attachment: 2.2-eos-failures-1.txt 2.2-eos-failures-2.txt > Follow-on: Streams Threads may die from recoverable errors with EOS enabled > --------------------------------------------------------------------------- > > Key: KAFKA-9268 > URL: https://issues.apache.org/jira/browse/KAFKA-9268 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.2.0 > Reporter: John Roesler > Assignee: John Roesler > Priority: Major > Fix For: 2.4.0 > > Attachments: 2.2-eos-failures-1.txt, 2.2-eos-failures-2.txt > > > While testing Streams in EOS mode under frequent and heavy network > partitions, I've encountered exceptions leading to thread death in both 2.2 > and 2.3 (although different exceptions). > I believe this problem is addressed in 2.4+ by > https://issues.apache.org/jira/browse/KAFKA-9231 , however, if you look at > the ticket and corresponding PR, you will see that the solution there > introduced some tech debt around UnknownProducerId that needs to be cleaned > up. Therefore, I'm not backporting that fix to older branches. Rather, I'm > opening a new ticket to make more conservative changes in older branches to > improve resilience, if desired. > These failures are relative rare, so I don't think that a system or > integration test could reliably reproduce it. The steps to reproduce would be: > 1. set up a long-running Streams application with EOS enabled (I used three > Streams instances) > 2. inject periodic network partitions (I had each Streams instance schedule > an interruption at a random time between 0 and 3 hours, then schedule the > interruption to last a random duration between 0 and 5 minutes. The > interruptions are accomplished by using iptables to drop all traffic to/from > all three brokers) > As far as the actual errors I've observed, I'm attaching the logs of two > incidents in which a thread was caused to shut down. -- This message was sent by Atlassian Jira (v8.3.4#803005)