[ 
https://issues.apache.org/jira/browse/KAFKA-18654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-18654:
-----------------------------------
    Description: 
https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race 
condition by returning with CONCURRENT_TRANSACTIONS early when the transaction 
was still completing. 
In testing, it was discovered that this early return could cause performance 
regressions.

Prior to KIP-890 the addpartitions call was a separate call from the producer. 
There was a previous change https://issues.apache.org/jira/browse/KAFKA-5477 
that decreased the retry backoff. With KIP-890 and making the call through the 
produce path, we go back to the default retry backoff which takes longer. Prior 
to 18575 we introduce a slight delay when sending to the coordinator, so prior 
to 18575, we are less likely to return quickly and get stuck in this backoff. 

There are two ways to address this regression:
1. Solve 18575 via the other proposed solution for that ticket, don't return 
early and check the epoch to avoid the verification guard race
2. With the bumped produce version, return concurrent transactions and change 
produce handling to have a shorter backoff for this error. 

 

We ended up taking approach 1 and including a server-side backoff/retry 
mechanism rather than relying on the client to backoff and retry. We also 
introduced new configurations in the KIP to control the backoff mechanisms.

  was:
https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race 
condition by returning with CONCURRENT_TRANSACTIONS early when the transaction 
was still completing. 
In testing, it was discovered that this early return could cause performance 
regressions.

Prior to KIP-890 the addpartitions call was a separate call from the producer. 
There was a previous change https://issues.apache.org/jira/browse/KAFKA-5477 
that decreased the retry backoff. With KIP-890 and making the call through the 
produce path, we go back to the default retry backoff which takes longer. Prior 
to 18575 we introduce a slight delay when sending to the coordinator, so prior 
to 18575, we are less likely to return quickly and get stuck in this backoff. 



There are two ways to address this regression:
1. Solve 18575 via the other proposed solution for that ticket, don't return 
early and check the epoch to avoid the verification guard race
2. With the bumped produce version, return concurrent transactions and change 
produce handling to have a shorter backoff for this error. 


> Transaction Version 2 performance regression due to early return
> ----------------------------------------------------------------
>
>                 Key: KAFKA-18654
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18654
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Justine Olshan
>            Assignee: Justine Olshan
>            Priority: Blocker
>             Fix For: 4.0.0
>
>
> https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race 
> condition by returning with CONCURRENT_TRANSACTIONS early when the 
> transaction was still completing. 
> In testing, it was discovered that this early return could cause performance 
> regressions.
> Prior to KIP-890 the addpartitions call was a separate call from the 
> producer. There was a previous change 
> https://issues.apache.org/jira/browse/KAFKA-5477 that decreased the retry 
> backoff. With KIP-890 and making the call through the produce path, we go 
> back to the default retry backoff which takes longer. Prior to 18575 we 
> introduce a slight delay when sending to the coordinator, so prior to 18575, 
> we are less likely to return quickly and get stuck in this backoff. 
> There are two ways to address this regression:
> 1. Solve 18575 via the other proposed solution for that ticket, don't return 
> early and check the epoch to avoid the verification guard race
> 2. With the bumped produce version, return concurrent transactions and change 
> produce handling to have a shorter backoff for this error. 
>  
> We ended up taking approach 1 and including a server-side backoff/retry 
> mechanism rather than relying on the client to backoff and retry. We also 
> introduced new configurations in the KIP to control the backoff mechanisms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to