[
https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790672#comment-17790672
]
Greg Harris commented on KAFKA-15912:
-------------------------------------
Hey [~mimaison] thanks for the ticket. I've only thought briefly about this and
haven't found any obvious blockers, but there are some design restrictions:
# Since the javadocs for Transformation and Predicate don't mention
thread-safety, I think we have to assume that they are not thread-safe
# There is room in the API for a Transformation to be stateful and
order-sensitive, (such as packing records together) so I think we would be
unable to instantiate multiple copies of a single transform stage, and all
records would have to pass serially through a stage.
If Transformations and Predicates could declare themselves thread-safe, then we
would be able to do some finer-grained parallelism, or fallback to actor-style
parallelism (a single thread with message queue input).
I think it would be ineffective/undesirable for Transformations to take this
performance optimization burden upon themselves completely like the Task
implementations do, so we should certainly improve the framework in this area.
> Parallelize conversion and transformation steps in Connect
> ----------------------------------------------------------
>
> Key: KAFKA-15912
> URL: https://issues.apache.org/jira/browse/KAFKA-15912
> Project: Kafka
> Issue Type: Improvement
> Components: connect
> Reporter: Mickael Maison
> Priority: Major
>
> In busy Connect pipelines, the conversion and transformation steps can
> sometimes have a very significant impact on performance. This is especially
> true with large records with complex schemas, for example with CDC connectors
> like Debezium.
> Today in order to always preserve ordering, converters and transformations
> are called on one record at a time in a single thread in the Connect worker.
> As Connect usually handles records in batches (up to max.poll.records in sink
> pipelines, for source pipelines while it really depends on the connector,
> most connectors I've seen still tend to return multiple records each loop),
> it could be highly beneficial to attempt running the converters and
> transformation chain in parallel by a pool a processing threads.
> It should be possible to do some of these steps in parallel and still keep
> exact ordering. I'm even considering whether an option to lose ordering but
> allow even faster processing would make sense.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)