Supporting 1 Kafka record -> many rows in sink connectors would be really
useful to properly support some data formats without incurring additional
overhead.

As an example, when consuming OpenTelmetry OTLP protobuf data, each record
represents multiple data points, which significantly reduces both bandwidth
and de-serialization overhead.

Today it is not possible to fully support such a format in Connect, where a
sink needs to write one data point per row into a database.
The workaround is to have a single datapoint per record, but that comes
with significant de-serialization and bandwidth overhead on the consumer
side, and also overhead on the producer side.
Adding an intermediate stream-processing doesn't remove those costs, it
only adds more.

For this particular example it wouldn't be necessary to track the entire
chain of transformations. If we could support this only at the converter
level it would already be a huge improvement.

Thanks!
Xavier

On Tue, May 16, 2023 at 4:15 AM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi Rabin,
>
> I like this idea but it presents challenges for delivery semantics,
> especially (though not exclusively) with sink connectors.
>
> If a sink connector uses the SinkTask::preCommit method [1] to explicitly
> notify the Connect runtime about which offsets are safe to commit, it may
> mistakenly report that an offset is safe to commit after successfully
> writing a single record with that offset to the sink system, even if there
> a 1->many transformation has created multiple records with that offset.
>
> If a sink connector doesn't use that method, then we'd also have to do some
> extra bookkeeping in the Connect runtime to track progress within a batch
> of records that were all derived from a single original record in Kafka. We
> currently use consumer group offsets to track sink connector progress
> (unless the connector chooses to use an alternative mechanism), which
> doesn't support any finer granularity than one record at a time, so there
> would have to be some non-trivial design work to figure things out on that
> front.
>
> There's also a question of how we'd want to apply this across entire chains
> of transformations. You mention the possibility of many->many; does this
> mean that there would be an opt-in interface for accepting an entire batch
> of records that was produced by a prior transformation?
>
> Overall my sense of this feature so far has been that there are a lot of
> edge cases that can bite our users if we're not careful about addressing
> them with clever design work, which is at least part of the reason that we
> haven't implemented it yet. I'd be interested in your thoughts about how we
> can make this feature safe and easy to use without adding footguns for
> existing connectors and deployments.
>
> [1] -
>
> https://kafka.apache.org/34/javadoc/org/apache/kafka/connect/sink/SinkTask.html#preCommit(java.util.Map)
>
> Cheers,
>
> Chris
>
> On Tue, May 16, 2023 at 6:26 AM Rabin Banerjee
> <rabin_baner...@apple.com.invalid> wrote:
>
> > Hi Team,
> >
> > We have a few use cases where nested record with array needs to be
> > flattened / exploded to multiple rows and write it to destination like
> > Cassandra.
> >
> > Now current Transformation interface in Kafka connect takes one record
> and
> > returns one / zero record.
> > Also current flatten ignores the array
> > https://issues.apache.org/jira/browse/KAFKA-12305.
> >
> > We would like to know about your thought on enhancing the transformation
> > interface to produce more than one record , 1 -> Many or Many -> Many.
> >
> > We understand an alternative is to use an upstream KStream pipeline but
> > that has multiple challenges like adding extra hop, more pipelines to
> > maintain etc.
> >
> > Enhancing the Transformation interface would allow us to have a Generic
> > SMT to handle this like  Explode, similar to
> >
> https://github.com/apache/spark/blob/007c4593058537251d83a4cb44efe31e394aee22/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4542
> > Thanks
> > Rabin
>

Reply via email to