[ https://issues.apache.org/jira/browse/KAFKA-18654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justine Olshan updated KAFKA-18654: ----------------------------------- Description: https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race condition by returning with CONCURRENT_TRANSACTIONS early when the transaction was still completing. In testing, it was discovered that this early return could cause performance regressions. Prior to KIP-890 the addpartitions call was a separate call from the producer. There was a previous change https://issues.apache.org/jira/browse/KAFKA-5477 that decreased the retry backoff. With KIP-890 and making the call through the produce path, we go back to the default retry backoff which takes longer. Prior to 18575 we introduce a slight delay when sending to the coordinator, so prior to 18575, we are less likely to return quickly and get stuck in this backoff. There are two ways to address this regression: 1. Solve 18575 via the other proposed solution for that ticket, don't return early and check the epoch to avoid the verification guard race 2. With the bumped produce version, return concurrent transactions and change produce handling to have a shorter backoff for this error. We ended up taking approach 1 and including a server-side backoff/retry mechanism rather than relying on the client to backoff and retry. We also introduced new configurations in the KIP to control the backoff mechanisms. was: https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race condition by returning with CONCURRENT_TRANSACTIONS early when the transaction was still completing. In testing, it was discovered that this early return could cause performance regressions. Prior to KIP-890 the addpartitions call was a separate call from the producer. There was a previous change https://issues.apache.org/jira/browse/KAFKA-5477 that decreased the retry backoff. With KIP-890 and making the call through the produce path, we go back to the default retry backoff which takes longer. Prior to 18575 we introduce a slight delay when sending to the coordinator, so prior to 18575, we are less likely to return quickly and get stuck in this backoff. There are two ways to address this regression: 1. Solve 18575 via the other proposed solution for that ticket, don't return early and check the epoch to avoid the verification guard race 2. With the bumped produce version, return concurrent transactions and change produce handling to have a shorter backoff for this error. > Transaction Version 2 performance regression due to early return > ---------------------------------------------------------------- > > Key: KAFKA-18654 > URL: https://issues.apache.org/jira/browse/KAFKA-18654 > Project: Kafka > Issue Type: Bug > Affects Versions: 4.0.0 > Reporter: Justine Olshan > Assignee: Justine Olshan > Priority: Blocker > Fix For: 4.0.0 > > > https://issues.apache.org/jira/browse/KAFKA-18575 solved a critical race > condition by returning with CONCURRENT_TRANSACTIONS early when the > transaction was still completing. > In testing, it was discovered that this early return could cause performance > regressions. > Prior to KIP-890 the addpartitions call was a separate call from the > producer. There was a previous change > https://issues.apache.org/jira/browse/KAFKA-5477 that decreased the retry > backoff. With KIP-890 and making the call through the produce path, we go > back to the default retry backoff which takes longer. Prior to 18575 we > introduce a slight delay when sending to the coordinator, so prior to 18575, > we are less likely to return quickly and get stuck in this backoff. > There are two ways to address this regression: > 1. Solve 18575 via the other proposed solution for that ticket, don't return > early and check the epoch to avoid the verification guard race > 2. With the bumped produce version, return concurrent transactions and change > produce handling to have a shorter backoff for this error. > > We ended up taking approach 1 and including a server-side backoff/retry > mechanism rather than relying on the client to backoff and retry. We also > introduced new configurations in the KIP to control the backoff mechanisms. -- This message was sent by Atlassian Jira (v8.20.10#820010)