[ 
https://issues.apache.org/jira/browse/KAFKA-18957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948960#comment-17948960
 ] 

Travis Bischel commented on KAFKA-18957:
----------------------------------------

As a point of consolation, I have not ran into this in Kafka 4 with KIP-890 
fully implemented.

> Impossible to recover hanging transactions in Kafka 3.9 in KRaft mode using 
> EOS
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-18957
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18957
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.8.1, 3.9.0
>            Reporter: Travis Bischel
>            Priority: Major
>         Attachments: docker-compose.yml, failures.tar.gz, log4j.properties, 
> script
>
>
> h1. The problem
> I am able to semi-reliably reproduce hanging unrecoverable transactions on 
> Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of 
> both 1 and 3. In the failure scenario, a transaction enters and permanently 
> hangs in either PrepareCommit or PrepareAbort. To try to fix the hanging 
> transaction, I have tried to:
>  * Wait until after the transactional timeout. Even after waiting hours just 
> to see, the transaction hung.
>  * Send a InitProducerID request with the LastProducerID and 
> LastProducerEpoch fields set. I receive the CONCURRENT_TRANSACTIONS error 
> code back, permanently (i.e. even after retrying this request for minutes)
>  * I have tried to initialize a _new_ transaction – that is, send 
> InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I 
> permanently receive CONCURRENT_TRANSACTIONS.
>  * I have tried to use kafka-transactions.sh to try to forcefully abort the 
> transaction. Nothing happens; the transaction still shows as PrepareCommit or 
> PrepareAbort.
> I encountered this error first while trying to add support for KIP-848 to my 
> franz-go client. I first thought this was an error with the branch, but to 
> double check, I looped my integration tests on the main branch against Kafka 
> 3.9 and periodically encountered the same permanent-hang failure. This is a 
> new failure; my tests have reliably passed against Kafka for years. I *do* 
> encounter this failure with eager consumer rebalancers, but it is less 
> common. Where the cooperative-sticky test suite may fail after three or four 
> runs, the range test suite (same tests, just using a different balancer) 
> fails after ten or so.
> h1. Description of the test suite
> franz-go has an integration test that
>  * Starts one transactional producer that produces 500K records to t1 (the 
> name in the tests is random), ending and beginning transactions every 10K 
> records 
> ([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29])
>  * Starts a chain of EOS consumer/producers. The first level (c1) consumes 
> from t1 and produces to t2, the second level (c2) consumes from t2 and 
> produces to t3, and the third level (c3) consumes from t3 and produces to t4.
>  * Starts a second set of EOS consumer/producers. One tries to join c1, but 
> aborts the first batch consumed and then leaves the group. Another also joins 
> c1 and aborts and leaves after two successful commits. The third joins c2 and 
> also aborts and leaves after two successful commits.
>  * After another 5 seconds, a second consumer that survives until the end 
> joins each of c1, c2, and c3 (to be clear: one new consumer joins c1, another 
> new consumer joins c2, a third new consumer joins c3).
>  * Another consumer joins c2 and aborts on the first transaction and leaves 
> the group. Another consumer joins c1 and aborts after one successful 
> transaction and leaves the group.
>  * The test waits. All groups stop processing once they have consumed 500K 
> records _and then_ timeout after 100ms of waiting for more records.
>  * Finally, the test verifies a bunch of things about what was produced & 
> seen. For example, all records must be seen in order, they cannot be seen 
> twice, and they must be modified at each step in the chain appropriately.
> The chaining setup can be seen 
> [here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513].
>  The helpers_test.go file and txn_test.go file are the two to look at to read 
> everything that is happening.
> The point is, the test is _somewhat_ rigorous in trying to trigger a few odd 
> edge conditions.
> h1. Logs
> I have captured both client and server logs to show the client is sending 
> requests in the proper order and to show that the failure is happening 
> entirely broker side, not client side (i.e., if my understanding is correct, 
> I am not sending requests improperly that would cause this).
> I have attached client and container logs in the .tar.gz attachment below, 
> with a README in each directory giving a high level overview of what we're 
> seeing in each directory. I'm including the README for the single broker 
> setup here:
>  * client_15_logs: these are the isolated logs for the client in question 
> that entered a hanging state. You can see EndTxn sent on l290 and we receive 
> a successful reply. We then send ListTransactions to verify the client has 
> exited the Prepare state. We loop issuing ListTransactions for >10s, until we 
> stop trying to verify on l400. We try to force bumping the epoch by one to 
> fix it in InitProducerID, but that permanently loops receiving a 
> CONCURRENT_TRANSACTIONS error. This continues for the rest of the file.
>  * client_raw_logs: these are raw logs for all clients combined in the 
> TestTxn/cooperative-sticky output.
>  * container_logs: these are raw logs for the container, the output of 
> `docker container logs`
>  * container_logs_for_client: this is container_logs but with all lines that 
> do not mention the hung transactional ID deleted. The hung transactional ID 
> is 314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, 
> we can see __consumer_offsets being brought into the transaction, and then 
> you see an ADD_OFFSETS_TO_TXN request being finished on l92 and then a 
> TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares 
> to transition from Ongoing to PrepareCommit. On l97, the broker prepares to 
> transition from PrepareCommit to CompleteCommit, and on l98 the broker 
> replies to the client's END_TXN request with a success. However, immediately 
> after it, you can see LIST_TRANSACTIONS being looped with state filters 
> PrepareCommit or PrepareAbort, and you can see the transactions never 
> transitions out of PrepareCommit. LIST_TRANSACTIONS loops for a while, then 
> you see the INIT_PRODUCER_ID loop for a long time, permanently receiving 
> CONCURRENT_TRANSACTIONS.
> h1. Files
>  * I have attached the docker-compose.yml I used for both a single broker 
> setup and a three broker setup. The single broker setup is commented out; the 
> setups can be easily swapped.
>  * I've attached the log4j.properties that I used to opt into debug logging 
> on the request logger as well as the transaction logger.
>  * I've attached the script I used to run the tests in a loop. The script 
> contains my local directory paths but should be easy enough to modify. 
> Essentially, it takes down the cluster and restarts it to have a clean slate 
> of logs, changes into my kgo test directory, and then runs the test piping 
> client log output to `logs`. *NOTE* The following environment variables are 
> required to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and 
> when running with a single broker, `KGO_TEST_RF=1`. The test setup does not 
> always hang, but for me, it hangs quite frequently (every few runs).
>  * I've attached a .tar.gz file containing two directories, one containing 
> client & broker logs for a single broker setup, another containing broker 
> only logs for the three broker setup.
> h1. franz-go branch
> The ListTransactions verification is not in franz-go main; I have added it to 
> a branch and to not plan to merge it to main. If you are trying to run this 
> setup locally, the code is here: 
> [https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8]
> I can also change stuff on my side to help debug what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to