[ https://issues.apache.org/jira/browse/KAFKA-18957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948960#comment-17948960 ]
Travis Bischel commented on KAFKA-18957: ---------------------------------------- As a point of consolation, I have not ran into this in Kafka 4 with KIP-890 fully implemented. > Impossible to recover hanging transactions in Kafka 3.9 in KRaft mode using > EOS > ------------------------------------------------------------------------------- > > Key: KAFKA-18957 > URL: https://issues.apache.org/jira/browse/KAFKA-18957 > Project: Kafka > Issue Type: Bug > Components: kraft > Affects Versions: 3.8.1, 3.9.0 > Reporter: Travis Bischel > Priority: Major > Attachments: docker-compose.yml, failures.tar.gz, log4j.properties, > script > > > h1. The problem > I am able to semi-reliably reproduce hanging unrecoverable transactions on > Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of > both 1 and 3. In the failure scenario, a transaction enters and permanently > hangs in either PrepareCommit or PrepareAbort. To try to fix the hanging > transaction, I have tried to: > * Wait until after the transactional timeout. Even after waiting hours just > to see, the transaction hung. > * Send a InitProducerID request with the LastProducerID and > LastProducerEpoch fields set. I receive the CONCURRENT_TRANSACTIONS error > code back, permanently (i.e. even after retrying this request for minutes) > * I have tried to initialize a _new_ transaction – that is, send > InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I > permanently receive CONCURRENT_TRANSACTIONS. > * I have tried to use kafka-transactions.sh to try to forcefully abort the > transaction. Nothing happens; the transaction still shows as PrepareCommit or > PrepareAbort. > I encountered this error first while trying to add support for KIP-848 to my > franz-go client. I first thought this was an error with the branch, but to > double check, I looped my integration tests on the main branch against Kafka > 3.9 and periodically encountered the same permanent-hang failure. This is a > new failure; my tests have reliably passed against Kafka for years. I *do* > encounter this failure with eager consumer rebalancers, but it is less > common. Where the cooperative-sticky test suite may fail after three or four > runs, the range test suite (same tests, just using a different balancer) > fails after ten or so. > h1. Description of the test suite > franz-go has an integration test that > * Starts one transactional producer that produces 500K records to t1 (the > name in the tests is random), ending and beginning transactions every 10K > records > ([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29]) > * Starts a chain of EOS consumer/producers. The first level (c1) consumes > from t1 and produces to t2, the second level (c2) consumes from t2 and > produces to t3, and the third level (c3) consumes from t3 and produces to t4. > * Starts a second set of EOS consumer/producers. One tries to join c1, but > aborts the first batch consumed and then leaves the group. Another also joins > c1 and aborts and leaves after two successful commits. The third joins c2 and > also aborts and leaves after two successful commits. > * After another 5 seconds, a second consumer that survives until the end > joins each of c1, c2, and c3 (to be clear: one new consumer joins c1, another > new consumer joins c2, a third new consumer joins c3). > * Another consumer joins c2 and aborts on the first transaction and leaves > the group. Another consumer joins c1 and aborts after one successful > transaction and leaves the group. > * The test waits. All groups stop processing once they have consumed 500K > records _and then_ timeout after 100ms of waiting for more records. > * Finally, the test verifies a bunch of things about what was produced & > seen. For example, all records must be seen in order, they cannot be seen > twice, and they must be modified at each step in the chain appropriately. > The chaining setup can be seen > [here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513]. > The helpers_test.go file and txn_test.go file are the two to look at to read > everything that is happening. > The point is, the test is _somewhat_ rigorous in trying to trigger a few odd > edge conditions. > h1. Logs > I have captured both client and server logs to show the client is sending > requests in the proper order and to show that the failure is happening > entirely broker side, not client side (i.e., if my understanding is correct, > I am not sending requests improperly that would cause this). > I have attached client and container logs in the .tar.gz attachment below, > with a README in each directory giving a high level overview of what we're > seeing in each directory. I'm including the README for the single broker > setup here: > * client_15_logs: these are the isolated logs for the client in question > that entered a hanging state. You can see EndTxn sent on l290 and we receive > a successful reply. We then send ListTransactions to verify the client has > exited the Prepare state. We loop issuing ListTransactions for >10s, until we > stop trying to verify on l400. We try to force bumping the epoch by one to > fix it in InitProducerID, but that permanently loops receiving a > CONCURRENT_TRANSACTIONS error. This continues for the rest of the file. > * client_raw_logs: these are raw logs for all clients combined in the > TestTxn/cooperative-sticky output. > * container_logs: these are raw logs for the container, the output of > `docker container logs` > * container_logs_for_client: this is container_logs but with all lines that > do not mention the hung transactional ID deleted. The hung transactional ID > is 314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, > we can see __consumer_offsets being brought into the transaction, and then > you see an ADD_OFFSETS_TO_TXN request being finished on l92 and then a > TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares > to transition from Ongoing to PrepareCommit. On l97, the broker prepares to > transition from PrepareCommit to CompleteCommit, and on l98 the broker > replies to the client's END_TXN request with a success. However, immediately > after it, you can see LIST_TRANSACTIONS being looped with state filters > PrepareCommit or PrepareAbort, and you can see the transactions never > transitions out of PrepareCommit. LIST_TRANSACTIONS loops for a while, then > you see the INIT_PRODUCER_ID loop for a long time, permanently receiving > CONCURRENT_TRANSACTIONS. > h1. Files > * I have attached the docker-compose.yml I used for both a single broker > setup and a three broker setup. The single broker setup is commented out; the > setups can be easily swapped. > * I've attached the log4j.properties that I used to opt into debug logging > on the request logger as well as the transaction logger. > * I've attached the script I used to run the tests in a loop. The script > contains my local directory paths but should be easy enough to modify. > Essentially, it takes down the cluster and restarts it to have a clean slate > of logs, changes into my kgo test directory, and then runs the test piping > client log output to `logs`. *NOTE* The following environment variables are > required to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and > when running with a single broker, `KGO_TEST_RF=1`. The test setup does not > always hang, but for me, it hangs quite frequently (every few runs). > * I've attached a .tar.gz file containing two directories, one containing > client & broker logs for a single broker setup, another containing broker > only logs for the three broker setup. > h1. franz-go branch > The ListTransactions verification is not in franz-go main; I have added it to > a branch and to not plan to merge it to main. If you are trying to run this > setup locally, the code is here: > [https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8] > I can also change stuff on my side to help debug what's going on here. -- This message was sent by Atlassian Jira (v8.20.10#820010)