Hi Rabin,

I like this idea but it presents challenges for delivery semantics,
especially (though not exclusively) with sink connectors.

If a sink connector uses the SinkTask::preCommit method [1] to explicitly
notify the Connect runtime about which offsets are safe to commit, it may
mistakenly report that an offset is safe to commit after successfully
writing a single record with that offset to the sink system, even if there
a 1->many transformation has created multiple records with that offset.

If a sink connector doesn't use that method, then we'd also have to do some
extra bookkeeping in the Connect runtime to track progress within a batch
of records that were all derived from a single original record in Kafka. We
currently use consumer group offsets to track sink connector progress
(unless the connector chooses to use an alternative mechanism), which
doesn't support any finer granularity than one record at a time, so there
would have to be some non-trivial design work to figure things out on that
front.

There's also a question of how we'd want to apply this across entire chains
of transformations. You mention the possibility of many->many; does this
mean that there would be an opt-in interface for accepting an entire batch
of records that was produced by a prior transformation?

Overall my sense of this feature so far has been that there are a lot of
edge cases that can bite our users if we're not careful about addressing
them with clever design work, which is at least part of the reason that we
haven't implemented it yet. I'd be interested in your thoughts about how we
can make this feature safe and easy to use without adding footguns for
existing connectors and deployments.

[1] -
https://kafka.apache.org/34/javadoc/org/apache/kafka/connect/sink/SinkTask.html#preCommit(java.util.Map)

Cheers,

Chris

On Tue, May 16, 2023 at 6:26 AM Rabin Banerjee
<rabin_baner...@apple.com.invalid> wrote:

> Hi Team,
>
> We have a few use cases where nested record with array needs to be
> flattened / exploded to multiple rows and write it to destination like
> Cassandra.
>
> Now current Transformation interface in Kafka connect takes one record and
> returns one / zero record.
> Also current flatten ignores the array
> https://issues.apache.org/jira/browse/KAFKA-12305.
>
> We would like to know about your thought on enhancing the transformation
> interface to produce more than one record , 1 -> Many or Many -> Many.
>
> We understand an alternative is to use an upstream KStream pipeline but
> that has multiple challenges like adding extra hop, more pipelines to
> maintain etc.
>
> Enhancing the Transformation interface would allow us to have a Generic
> SMT to handle this like  Explode, similar to
> https://github.com/apache/spark/blob/007c4593058537251d83a4cb44efe31e394aee22/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4542
> Thanks
> Rabin

Reply via email to