[jira] [Updated] (KAFKA-17582) Unpredictable consumer position after transaction abort

Kirk True (Jira) Wed, 25 Sep 2024 19:41:05 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-17582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kirk True updated KAFKA-17582:
------------------------------
    Component/s: consumer

> Unpredictable consumer position after transaction abort
> -------------------------------------------------------
>
>                 Key: KAFKA-17582
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17582
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer, documentation
>    Affects Versions: 3.8.0
>            Reporter: Kyle Kingsbury
>            Priority: Critical
>              Labels: abort, offset, transaction
>         Attachments: 20240919T124411.740-0500(1).zip, Screenshot from 
> 2024-09-19 18-45-34.png
>
>
> With the official Kafka Java client, version 3.8.0, the position of consumers 
> after a transaction aborts appears unpredictable. Sometimes the consumer 
> moves on, skipping over the records it polled in the aborted transaction. 
> Sometimes it rewinds to read them again. Sometimes it rewinds *further* than 
> the most recent transaction.
> Since the goal of transactions is to enable "exactly-once semantics", it 
> seems sensible that the consumer should rewind on abort, such that any 
> subsequent transactions would start at the same offsets. Not rewinding leads 
> to data loss, since messages are consumed but their effects are not 
> committed. Rewinding too far is... just weird.
> I'm seeing this issue in Jepsen tests of Kafka 3.0.0 and other 
> Kafka-compatible systems.  It occurs without faults, and with a single 
> producer and consumer; no other concurrent processes. Here's the producer and 
> consumer config:
>  
> {{{}Producer config: {"socket.connection.setup.timeout.max.ms" 1000, 
> "transactional.id" "jt1", "bootstrap.servers" "n3:9092", "request.timeout.ms" 
> 3000, "enable.idempotence" true, "max.block.ms" 10000, "value.serializer" 
> "org.apache.kafka.common.serialization.LongSerializer", "retries" 1000, 
> "key.serializer" "org.apache.kafka.common.serialization.LongSerializer", 
> "socket.connection.setup.timeout.ms" 500, "reconnect.backoff.max.ms" 1000, 
> "delivery.timeout.ms" 10000, "acks" "all", "transaction.timeout.ms" 1000{}}}}
> {{{}Consumer config: {"socket.connection.setup.timeout.max.ms" 1000, 
> "bootstrap.servers" "n5:9092", "request.timeout.ms" 10000, 
> "connections.max.idle.ms" 60000, "session.timeout.ms" 6000, 
> "heartbeat.interval.ms" 300, "key.deserializer" 
> "org.apache.kafka.common.serialization.LongDeserializer", "group.id" 
> "jepsen-group", "metadata.max.age.ms" 60000, "auto.offset.reset" "earliest", 
> "isolation.level" "read_committed", "socket.connection.setup.timeout.ms" 500, 
> "value.deserializer" 
> "org.apache.kafka.common.serialization.LongDeserializer", 
> "enable.auto.commit" false, "default.api.timeout.ms" 10000{}}}}
>  
> Attached is a test run that shows this behavior, as well as a visualization 
> of the reads (polls) and writes (sends) of a single topic-partition.
> In this plot, time flows down, and offsets run left to right. Each 
> transaction is a single horizontal line. `w1` denotes a send of value 1, and 
> `r2` denotes a poll of read 2. All operations here are performed by the sole 
> process in the system, which has a single Kafka consumer and a single Kafka 
> client. First,  a transaction writes 35 and commits. Second, a transaction 
> reads 35 and aborts. Third, a transaction reads 35 and aborts: the consumer 
> has clearly re-wound to show the same record twice.
> Then a transaction writes 37. Immediately thereafter a transaction reads 37 
> and 38. Unlike before, it did *not* rewind. This transaction also aborts.
> Finally, a transaction writes 39 and 40. Then a transaction reads 39 and 40. 
> This transaction commits! Values 35, 37, and 38 have been lost!
> It doesn't seem possible that this is the effect of a consumer rebalance: 
> rebalancing should start off the consumer at the last *committed* offset, and 
> the last committed offset in this history was actually value 31–it should 
> have picked up at 35, 37, etc. This test uses auto.offset.reset=earliest, so 
> if the commit were somehow missing, it should have rewound to the start of 
> the topic-partition.
> What... *should* Kafka do with respect to consumer offsets when a transaction 
> aborts? And is there any sort of documentation for this? I've been digging 
> into this problem for almost a week–it manifested as write loss in a Jepsen 
> test--and I'm baffled as to how to proceed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17582) Unpredictable consumer position after transaction abort

Reply via email to