[ 
https://issues.apache.org/jira/browse/KAFKA-18957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17934038#comment-17934038
 ] 

Justine Olshan edited comment on KAFKA-18957 at 3/10/25 10:00 PM:
------------------------------------------------------------------

Hey there, thanks for reporting. 
This appears to be a transaction where the marker write is not completing. This 
is slightly different from the "hanging transaction" terminology the Kafka 
project typically used for when the transaction coordinator was not aware of 
the transaction and thus couldn't timeout and abort. 

Here, it seems there are many signals (your manual attempts + the producer 
timeout) trying to abort the transaction from the coordinator, but the 
operation is not completing. I have seen this in practice a few times and it 
usually involved a failure for the request to retry (ie, WriteMarkerRequest 
lost) or unable to write the marker (max.message.bytes for the partition was 0) 
I will try to take a look at this specific case. 


was (Author: jolshan):
Hey there, thanks for reporting. 
This appears to be a transaction where the marker write is not completing. This 
is slightly different from the "hanging transaction" terminology the Kafka 
project typically used for when the transaction coordinator was not aware of 
the transaction and thus couldn't timeout and abort. 

Here, it seems there are many signals (your manual attempts + the producer 
timeout) trying to abort the transaction from the coordinator, but the 
operation is not completing. I have seen this in practice a few times. I will 
try to take a look at this specific case. 

> Impossible to recover hanging transactions in Kafka 3.9 in KRaft mode using 
> EOS
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-18957
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18957
>             Project: Kafka
>          Issue Type: Bug
>          Components: kraft
>    Affects Versions: 3.9.0, 3.8.1
>            Reporter: Travis Bischel
>            Priority: Major
>         Attachments: docker-compose.yml, failures.tar.gz, log4j.properties, 
> script
>
>
> h1. The problem
> I am able to semi-reliably reproduce hanging unrecoverable transactions on 
> Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of 
> both 1 and 3. In the failure scenario, a transaction enters and permanently 
> hangs in either PrepareCommit or PrepareAbort. To try to fix the hanging 
> transaction, I have tried to:
>  * Wait until after the transactional timeout. Even after waiting hours just 
> to see, the transaction hung.
>  * Send a InitProducerID request with the LastProducerID and 
> LastProducerEpoch fields set. I receive the CONCURRENT_TRANSACTIONS error 
> code back, permanently (i.e. even after retrying this request for minutes)
>  * I have tried to initialize a _new_ transaction – that is, send 
> InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I 
> permanently receive CONCURRENT_TRANSACTIONS.
>  * I have tried to use kafka-transactions.sh to try to forcefully abort the 
> transaction. Nothing happens; the transaction still shows as PrepareCommit or 
> PrepareAbort.
> I encountered this error first while trying to add support for KIP-848 to my 
> franz-go client. I first thought this was an error with the branch, but to 
> double check, I looped my integration tests on the main branch against Kafka 
> 3.9 and periodically encountered the same permanent-hang failure. This is a 
> new failure; my tests have reliably passed against Kafka for years. I *do* 
> encounter this failure with eager consumer rebalancers, but it is less 
> common. Where the cooperative-sticky test suite may fail after three or four 
> runs, the range test suite (same tests, just using a different balancer) 
> fails after ten or so.
> h1. Description of the test suite
> franz-go has an integration test that
>  * Starts one transactional producer that produces 500K records to t1 (the 
> name in the tests is random), ending and beginning transactions every 10K 
> records 
> ([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29])
>  * Starts a chain of EOS consumer/producers. The first level (c1) consumes 
> from t1 and produces to t2, the second level (c2) consumes from t2 and 
> produces to t3, and the third level (c3) consumes from t3 and produces to t4.
>  * Starts a second set of EOS consumer/producers. One tries to join c1, but 
> aborts the first batch consumed and then leaves the group. Another also joins 
> c1 and aborts and leaves after two successful commits. The third joins c2 and 
> also aborts and leaves after two successful commits.
>  * After another 5 seconds, a second consumer that survives until the end 
> joins each of c1, c2, and c3 (to be clear: one new consumer joins c1, another 
> new consumer joins c2, a third new consumer joins c3).
>  * Another consumer joins c2 and aborts on the first transaction and leaves 
> the group. Another consumer joins c1 and aborts after one successful 
> transaction and leaves the group.
>  * The test waits. All groups stop processing once they have consumed 500K 
> records _and then_ timeout after 100ms of waiting for more records.
>  * Finally, the test verifies a bunch of things about what was produced & 
> seen. For example, all records must be seen in order, they cannot be seen 
> twice, and they must be modified at each step in the chain appropriately.
> The chaining setup can be seen 
> [here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513].
>  The helpers_test.go file and txn_test.go file are the two to look at to read 
> everything that is happening.
> The point is, the test is _somewhat_ rigorous in trying to trigger a few odd 
> edge conditions.
> h1. Logs
> I have captured both client and server logs to show the client is sending 
> requests in the proper order and to show that the failure is happening 
> entirely broker side, not client side (i.e., if my understanding is correct, 
> I am not sending requests improperly that would cause this).
> I have attached client and container logs in the .tar.gz attachment below, 
> with a README in each directory giving a high level overview of what we're 
> seeing in each directory. I'm including the README for the single broker 
> setup here:
>  * client_15_logs: these are the isolated logs for the client in question 
> that entered a hanging state. You can see EndTxn sent on l290 and we receive 
> a successful reply. We then send ListTransactions to verify the client has 
> exited the Prepare state. We loop issuing ListTransactions for >10s, until we 
> stop trying to verify on l400. We try to force bumping the epoch by one to 
> fix it in InitProducerID, but that permanently loops receiving a 
> CONCURRENT_TRANSACTIONS error. This continues for the rest of the file.
>  * client_raw_logs: these are raw logs for all clients combined in the 
> TestTxn/cooperative-sticky output.
>  * container_logs: these are raw logs for the container, the output of 
> `docker container logs`
>  * container_logs_for_client: this is container_logs but with all lines that 
> do not mention the hung transactional ID deleted. The hung transactional ID 
> is 314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, 
> we can see __consumer_offsets being brought into the transaction, and then 
> you see an ADD_OFFSETS_TO_TXN request being finished on l92 and then a 
> TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares 
> to transition from Ongoing to PrepareCommit. On l97, the broker prepares to 
> transition from PrepareCommit to CompleteCommit, and on l98 the broker 
> replies to the client's END_TXN request with a success. However, immediately 
> after it, you can see LIST_TRANSACTIONS being looped with state filters 
> PrepareCommit or PrepareAbort, and you can see the transactions never 
> transitions out of PrepareCommit. LIST_TRANSACTIONS loops for a while, then 
> you see the INIT_PRODUCER_ID loop for a long time, permanently receiving 
> CONCURRENT_TRANSACTIONS.
> h1. Files
>  * I have attached the docker-compose.yml I used for both a single broker 
> setup and a three broker setup. The single broker setup is commented out; the 
> setups can be easily swapped.
>  * I've attached the log4j.properties that I used to opt into debug logging 
> on the request logger as well as the transaction logger.
>  * I've attached the script I used to run the tests in a loop. The script 
> contains my local directory paths but should be easy enough to modify. 
> Essentially, it takes down the cluster and restarts it to have a clean slate 
> of logs, changes into my kgo test directory, and then runs the test piping 
> client log output to `logs`. *NOTE* The following environment variables are 
> required to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and 
> when running with a single broker, `KGO_TEST_RF=1`. The test setup does not 
> always hang, but for me, it hangs quite frequently (every few runs).
>  * I've attached a .tar.gz file containing two directories, one containing 
> client & broker logs for a single broker setup, another containing broker 
> only logs for the three broker setup.
> h1. franz-go branch
> The ListTransactions verification is not in franz-go main; I have added it to 
> a branch and to not plan to merge it to main. If you are trying to run this 
> setup locally, the code is here: 
> [https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8]
> I can also change stuff on my side to help debug what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to