FrancisGodinho opened a new pull request, #21161:
URL: https://github.com/apache/kafka/pull/21161

   # Problem 
   During broker upgrades, the `sendOffsetsToTransaction` call would sometimes 
hang. Logs showed that it continuously returned `errorCode=51` which is 
`CONCURRENT_TRANSACTION`. The test would eventually hit its timeout and fail. 
This happened for every single version upgrade and occurred in around 30% of 
the runs.
   
   # Resolution 
   The problem above left the producer in a broken state and even after 5-10 
minutes of waiting, it didn't resolve itself (even if we waited a few minutes 
past the transaction.max.ms time). I tried multiple solutions including waiting 
extended periods of time and re-trying the `sendOffsetsToTransaction` multiple 
times whenever timeout occurred.
   
   Unfortunately, the producer was just permanently stuck and always receiving 
the `errorCode=51`. In this case, the recommended resolution in the Kafka docs 
is to close the previous producer and create a new producer. 
https://kafka.apache.org/documentation/#usingtransactions 
   <img width="652" height="59" alt="image" 
src="https://github.com/user-attachments/assets/e95500d6-f1b6-44fa-b6a2-5c1800448d32";
 />
   
   Using the old transaction.id would continue to lead to a stuck state, so 
this fix creates a brand new producer with a new ID and then rewinds the 
consumer offset to ensure EOD.
   
   # Testing and Validation
   Previously, I was able to run the test for a single version upgrade and have 
it fail within the first 5-10 runs. After the fix, I was able to run it 40 
times continuously with 0 failures. I also ran the full test (all versions) ~5 
times with 9/9 cases passing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to