[
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163142#comment-17163142
]
Sophie Blee-Goldman commented on KAFKA-8037:
--------------------------------------------
Ah, thanks for bringing us back to the question of double-topic vs
restore-time. I don't think I touched on this earlier and may have taken my
thoughts on this question for granted without explaining them. If we can agree
that the asymmetric/side effect serdes are not a problem here (and that is a
big "if") then in the case that we may have corrupt data (non-default DEH) I
think we should just deserialize during restoration instead of adding the
changelog. Since we only have to deserialize and not serialize, the performance
hit might not be as bad. Also we have a number of improvements to restoration
implementation and soon-to-be implemented that make restoration performance
less of a pain point. For one thing, with KIP-441 most of the restoration will
occur in the background anyway as long as there is one caught up client. Moving
restoration to a separate thread will speed up restoration (hopefully) but more
importantly it means that the main thread can continue to process other active
tasks rather than being completely blocked on recovery. Plus all the rocksdb
optimizations being considered
> KTable restore may load bad data
> --------------------------------
>
> Key: KAFKA-8037
> URL: https://issues.apache.org/jira/browse/KAFKA-8037
> Project: Kafka
> Issue Type: Improvement
> Components: streams
> Reporter: Matthias J. Sax
> Priority: Minor
> Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a
> `deserialization.exception.handler` to drop corrupted records on read.
> However, this mechanism may be by-passed on restore. Assume a
> `builder.table()` call reads and drops a corrupted record. If the table state
> is lost and restored from the changelog topic, the corrupted record may be
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the
> record would fail with a deserialization exception if the value part cannot
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed
> for KTables though.
> Note, that user state stores are not affected, because they always have a
> dedicated changelog topic (and don't reuse an input topic) and thus the
> corrupted record would not be written into the changelog.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)