Hi Karsten,

Before delving deeper into Kafka Streams, it's worth considering if direct
aggregation in the database might be a more straightforward solution,
unless there's a compelling reason to avoid it. Aggregating data at the
database level often leads to more efficient and maintainable systems. This
approach could involve creating intermediate tables for aggregation,
simplifying data management and reducing the complexity of downstream
processing.

If the data model allows, consider pre-aggregating or restructuring data
during writes. This step can significantly streamline the data before it's
ingested into Kafka, which can then be used primarily for CDC and
notifications.

However, if Kafka Streams aligns better with your infrastructure and
requirements, a few more details would be helpful for a tailored
suggestion. For instance, are you using RocksDB for state stores, and how
are these states managed - are they backed up on local disks or in Kafka
itself?

In Kafka Streams, consider converting KTables to KStreams for
communications or labels to minimize state management needs. Keep in mind,
though, that this approach might not reflect customer updates immediately.

Also, exploring the Processor API could offer more control over data flow
and state management. It's a bit more complex but can be very effective, as
discussed in this
<https://medium.com/@dharinshah1234/the-hidden-complexity-of-kafka-streams-dsl-solving-scaling-issues-8620fd344a6d>
article.

Thanks,
Dharin

On Wed, Jan 24, 2024 at 12:55 PM Karsten Stöckmann <
karsten.stoeckm...@gmail.com> wrote:
>
> Hi,
>
> we're currently in the process of evaluating Debezium and Kafka as a CDC
> system for our Postgres database in order to build an indexing solution
> (i.e. Solr or OpenSearch).
>
> Debezium captures changes per table and propagates them into dedicated
> Kafka topics each. The ingested tables originally feature multiple
> relationships of all kinds (1:1, 1:n, n:m) - aggregated data from those
> tables should eventually reflect in composite documents.
> To illustrate this, assume the following heavily simplified table layout:
>
>    - customer(id, name, firstname),
>    - communication(id, customer_id, value),
>    - label(id, customer_id, value),
>    - (maybe more dependent tables related to customers 1:n or 1:1 or even
>    n:m - not considered here for brevity).
>
> All tables are streamed into Kafka topics as pointed out above. Now in
> order to aggregate an output customer with all associated communication
and
> label values (i.e. aggregated as lists), what would be the most elegant
> solution leveraging Kafka Streams? Note that customers do not
necessisarily
> have any communication or label at all, thus non-key joins are out of the
> game as far as I understand.
>
> Our initial (naive) solution was to re-key the dependent KTables and then
> left joining, but that involves a lot of steps and intermediate State
> Stores, especially when considering the stated example is heavily
> simplified:
>
> KTable<Long, Customer> customers = streamBuilder.table(customerTopic,
> Consumed.with(<Serdes>));
> KTable<Long, Comm> comm = streamBuilder.table(commTopic,
> Consumed.with(<Serdes>));
> KTable<Long, Label> labels = streamBuilder.table(labelTopic,
> Consumed.with(<Serdes>));
> // Re-Keying
> KTable<Long, GroupedComm> groupedComm =
> communications.groupBy(...).aggregate(...);
> KTable<Long, GroupedLabel> groupedLabels =
> labels.groupBy(...).aggregate(...);
> // Join
> KTable<Long, AggregateCustomer> aggregated = customers
> .leftJoin(groupedComm, ...)
> .leftJoin(groupedLabel, ...)
> .groupBy(...)
> .aggregate(...);
>
> Are there more efficient / less naive / more elegant / simpler solutions
to
> this? Co-grouping input streams did not yield expected results...
>
> Best wishes,
>
> Karsten

Reply via email to