[ 
https://issues.apache.org/jira/browse/KAFKA-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988063#comment-16988063
 ] 

Guozhang Wang commented on KAFKA-9268:
--------------------------------------

[~ableegoldman] while reviewing the PR for KAFKA-9231 I have some thoughts 
about the possible causes that `unknown producer id` can kill a thread (more 
details on the PR comment), and now I think it is not a regression since 1) 
producer transaction manager code did not change since day 1 of EOS introduced 
2) streams code did not try to handle `unknown producer id` ever. 

With KIP-360 broker would not return this error any more but that would only be 
in newer versions (2.5+) so I think it is still on Streams to gracefully 
capture and handle it when talking to older brokers.

> Follow-on: Streams Threads may die from recoverable errors with EOS enabled
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-9268
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9268
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.2.0
>            Reporter: John Roesler
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: 2.2-eos-failures-1.txt, 2.2-eos-failures-2.txt
>
>
> While testing Streams in EOS mode under frequent and heavy network 
> partitions, I've encountered exceptions leading to thread death in both 2.2 
> and 2.3 (although different exceptions).
> I believe this problem is addressed in 2.4+ by 
> https://issues.apache.org/jira/browse/KAFKA-9231 , however, if you look at 
> the ticket and corresponding PR, you will see that the solution there 
> introduced some tech debt around UnknownProducerId that needs to be cleaned 
> up. Therefore, I'm not backporting that fix to older branches. Rather, I'm 
> opening a new ticket to make more conservative changes in older branches to 
> improve resilience, if desired.
> These failures are relative rare, so I don't think that a system or 
> integration test could reliably reproduce it. The steps to reproduce would be:
> 1. set up a long-running Streams application with EOS enabled (I used three 
> Streams instances)
> 2. inject periodic network partitions (I had each Streams instance schedule 
> an interruption at a random time between 0 and 3 hours, then schedule the 
> interruption to last a random duration between 0 and 5 minutes. The 
> interruptions are accomplished by using iptables to drop all traffic to/from 
> all three brokers)
> As far as the actual errors I've observed, I'm attaching the logs of two 
> incidents in which a thread was caused to shut down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to