[
https://issues.apache.org/jira/browse/KAFKA-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904904#comment-17904904
]
Greg Harris commented on KAFKA-18073:
-------------------------------------
Hmm, yeah this is a regression introduced by KAFKA-6738, added to 2.0.0. The PR
and KIP say that it's backwards-compatible, but <2.0 RetriableExceptions caused
tasks to fail, and >= 2.0 RetriableExceptions cause the record to be dropped.
I think you can work-around this with a config:
[https://kafka.apache.org/documentation/#sourceconnectorconfigs_errors.retry.timeout]
which causes the RetriableExceptions to be retried. If you set it to infinite
retries, the record will never be dropped, but also the task will never make
progress.
As for changing this; I think it is reasonable to make it so that
errors.tolerance=none never causes a record to be dropped, as I don't think
someone would want RetriableExceptions to be dropped, but non-retriable
exceptions to throw an error.
> Data loss when exception is raised from Kafka Connect record conversion call
> ----------------------------------------------------------------------------
>
> Key: KAFKA-18073
> URL: https://issues.apache.org/jira/browse/KAFKA-18073
> Project: Kafka
> Issue Type: Bug
> Components: connect
> Affects Versions: 3.1.0, 3.0.0, 3.0.1, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2,
> 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2,
> 3.7.0, 3.6.1, 3.6.2, 3.8.0, 3.7.1, 3.9.0, 3.8.1
> Reporter: Tom Thornton
> Priority: Major
> Labels: connect
>
> We are experiencing data loss when Kafka Connect source connector fails to
> send data to Kafka topic after receiving a retriable exception from a schema
> registry (when converting the record).
> Kafka cluster & connect details:
> # Kafka Connect version: 3.8.1
> # Kafka Version 3.6.1
> # Cluster size: 16 brokers
> # Number of partitions in Kafka data topic: 32
> # Number of partitions in offsets topic:
> # Replication Factor: 4
> # Min In Sync Replicas: 2
> # Unclean leader election disabled
> # Uses Apicurio schema registry to convert to Avro
> # Uses default error tolerance i.e., None
> Producer configs
> * acks = -1
> * max.in.flight.requests.per.connection = 1
> * batch.size = 524288
> Our connector uses the offsets committed by Kafka Connect (does not have any
> internal checkpointing/state storage).
> Data loss is noticed on occasion when a timeout is received during the
> [convertTransformedRecod|https://github.com/apache/kafka/blob/3.8.1/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L398]
> call. The timeout is from connecting to the schema registry for the schema.
> Full stack trace
> [here|https://gist.github.com/twthorn/5764b9dae2664ca72e71c1e6ebd4c792]
> Logs (dropped record is from task 79):
> {code:java}
> 2024-10-22T01:37:41.059Z - INFO - WorkerSourceTask{id=keyspace1-connector-79}
> Committing offsets for 3 acknowledged messages
> 2024-10-22T01:37:41.066Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-232} Committing offsets for 1
> acknowledged messages
> 2024-10-22T01:37:41.540Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-259} Committing offsets for 3
> acknowledged messages
> 2024-10-22T01:37:41.928Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-273} Committing offsets for 2
> acknowledged messages
> 2024-10-22T01:37:44Z - info - Handshake message sent
> 2024-10-22T01:37:51Z - info - Handshake timed out
> 2024-10-22T01:37:52Z - info - Handshake message sent
> 2024-10-22T01:37:58Z - info - Handshake timed out
> 2024-10-22T01:38:04Z - info - Handshake message sent
> 2024-10-22T01:38:10Z - info - Handshake timed out
> 2024-10-22T01:38:11.664Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-268] Node 1078
> disconnected.
> 2024-10-22T01:38:15.798Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-219] Node 1089
> disconnected.
> 2024-10-22T01:38:20Z - info - Handshake message sent
> 2024-10-22T01:38:23.225Z - INFO - [Producer
> clientId=connector-producer-keyspace1-connector-107] Node 1085 disconnected.
> 2024-10-22T01:38:23.224Z - INFO - [Producer
> clientId=connector-producer-keyspace1-connector-107] Node 1088 disconnected.
> 2024-10-22T01:38:24.536Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-13] Node 1076
> disconnected.
> 2024-10-22T01:38:27Z - info - Handshake timed out
> 2024-10-22T01:38:27.491Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-237] Node 1082
> disconnected.
> 2024-10-22T01:38:27.492Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-237] Node 1089
> disconnected.
> 2024-10-22T01:38:27.493Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-237] Node 1085
> disconnected.
> 2024-10-22T01:38:27.832Z - INFO - [Producer
> clientId=connector-producer-keyspace1-table1-connector-237] Node 1088
> disconnected.
> 2024-10-22T01:38:28.470Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-12} Committing offsets for 2
> acknowledged messages
> 2024-10-22T01:38:28.549Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-64} Committing offsets for 14
> acknowledged messages
> 2024-10-22T01:38:36.930Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-53} Committing offsets for 38
> acknowledged messages
> 2024-10-22T01:38:37.701Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-13} Committing offsets for 2
> acknowledged messages
> 2024-10-22T01:38:37.838Z - INFO - [AdminClient
> clientId=dbzium-prod-keyspace1-shared-admin] Node 1089 disconnected.
> 2024-10-22T01:38:38.369Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-25} Committing offsets for 6
> acknowledged messages
> 2024-10-22T01:38:39.212Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-56} Committing offsets for 21
> acknowledged messages
> 2024-10-22T01:38:39.597Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-51} Committing offsets for 11
> acknowledged messages
> 2024-10-22T01:38:39.642Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-219} Committing offsets for 4
> acknowledged messages
> 2024-10-22T01:38:39.647Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-237} Committing offsets for 1
> acknowledged messages
> 2024-10-22T01:38:39.681Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-268} Committing offsets for 8
> acknowledged messages
> 2024-10-22T01:38:41.059Z - INFO -
> WorkerSourceTask{id=keyspace1-connector-107} Committing offsets for 3
> acknowledged messages
> 2024-10-22T01:38:41.540Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-232} Committing offsets for 3
> acknowledged messages
> 2024-10-22T01:38:41.928Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-259} Committing offsets for 2
> acknowledged messages
> 2024-10-22T01:38:42.252Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-273} Committing offsets for 1
> acknowledged messages
> 2024-10-22T01:38:44.913Z - ERROR - Failed to send HTTP request to endpoint:
> http://schema-registry.service.prod-us-east-1-dw1.consul:8080/apis/ccompat/v6/subjects/prod.keyspace1.table1-key/versions?normalize=false
> 2024-10-22T01:38:52Z - info - Handshake message sent
> 2024-10-22T01:39:06.790Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Rebalance started
> 2024-10-22T01:39:07.134Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Successfully joined group with generation
> Generation{generationId=2477,
> memberId='connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083-a174acf5-1b8f-419a-a25a-4c58dab02a8d',
> protocol='sessioned'}
> 2024-10-22T01:39:07.171Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Successfully synced group in generation
> Generation{generationId=2477,
> memberId='connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083-a174acf5-1b8f-419a-a25a-4c58dab02a8d',
> protocol='sessioned'}
> 2024-10-22T01:39:07.171Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Joined group at generation 2477 with protocol
> version 2 and got assignment: Assignment{error=0,
> leader='connect-dbzium-prod-keyspace1-b7cfc9bcc-h8gxb.node.prod-us-east-1-vitess1.consul:8083-dbbc75cb-1de9-4907-bf4c-f3f6cca6eb9a',
>
> leaderUrl='http://dbzium-prod-keyspace1-b7cfc9bcc-h8gxb.node.prod-us-east-1-vitess1.consul:8083/',
> offset=236838, connectorIds=[], taskIds=[keyspace1-table1-connector-12,
> keyspace1-table1-connector-13, keyspace1-table1-connector-25,
> keyspace1-table1-connector-51, keyspace1-table1-connector-53,
> keyspace1-table1-connector-56, keyspace1-table1-connector-64,
> keyspace1-table1-connector-219, keyspace1-table1-connector-232,
> keyspace1-table1-connector-237, keyspace1-table1-connector-259,
> keyspace1-table1-connector-268, keyspace1-table1-connector-273,
> keyspace1-connector-79, keyspace1-connector-107, keyspace1-connector-285],
> revokedConnectorIds=[], revokedTaskIds=[], delay=180000} with rebalance
> delay: 180000
> 2024-10-22T01:39:07.172Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Finished starting connectors and tasks
> 2024-10-22T01:39:07.172Z - INFO - [Worker
> clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
> groupId=dbzium-prod-keyspace1] Starting connectors and tasks using config
> offset 236838
> 2024-10-22T01:39:12Z - info - Handshake message received
> 2024-10-22T01:39:12Z - info - Handshake message sent
> 2024-10-22T01:39:27.534Z - INFO - [Producer
> clientId=connector-producer-keyspace1-connector-107] Node 1074 disconnected.
> 2024-10-22T01:39:27.535Z - INFO - [Producer
> clientId=connector-producer-keyspace1-connector-107] Node 1077 disconnected.
> 2024-10-22T01:39:28.550Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-12} Committing offsets for 4
> acknowledged messages
> 2024-10-22T01:39:28.622Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-64} Committing offsets for 48
> acknowledged messages
> 2024-10-22T01:39:37.102Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-53} Committing offsets for 30
> acknowledged messages
> 2024-10-22T01:39:38.194Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-13} Committing offsets for 3
> acknowledged messages
> 2024-10-22T01:39:38.556Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-25} Committing offsets for 14
> acknowledged messages
> 2024-10-22T01:39:39.598Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-56} Committing offsets for 15
> acknowledged messages
> 2024-10-22T01:39:40.043Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-51} Committing offsets for 8
> acknowledged messages
> 2024-10-22T01:39:40.086Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-219} Committing offsets for 6
> acknowledged messages
> 2024-10-22T01:39:40.096Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-237} Committing offsets for 1
> acknowledged messages
> 2024-10-22T01:39:40.100Z - INFO -
> WorkerSourceTask{id=keyspace1-table1-connector-268} Committing offsets for 8
> acknowledged messages
> 2024-10-22T01:39:41.392Z - INFO - WorkerSourceTask{id=keyspace1-connector-79}
> Committing offsets for 12 acknowledged messages{code}
> Does Kafka Connect commit offsets for a batch even if some of the records
> failed? Or must the whole batch be successful to commit offsets?
> It is odd there is no restart, just a logged error that should have been
> retried.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)