The idea of "composite key" to make log compaction more flexible - question / proposal

Michal Michalski Mon, 02 Oct 2017 06:30:09 -0700

Hi,

TL;DR: I'd love to be able to make log compaction more "granular" than just
per-partition-key, so I was thinking about the concept of a "composite
key", where partitioning logic is using one part of the key, while
compaction uses the whole key - is this something desirable / doable /
worth a KIP?


Longer story / use case:

I'm currently a member of a team working on a project that's using a bunch
of applications to ingest data to the system (one "entity type" per app).
Once ingested by each application, since the entities are referring to each
other, they're all published to a single topic to ensure ordering for later
processing stages. Because of the nature of the data, for a given set of
entities related together, there's always a single "master" / parent"
entity, which ID we're using as the partition key; to give an example:
let's say you have "product" entity which can have things like "media",
"reviews", "stocks" etc. associated with it - product ID will be the
partition key for *all* these entities. However, with this approach we
simply cannot use log compaction because having e.g. "product", "media" and
"review" events, all with the same partition key "X", means that compaction
process will at some point delete all but one of them, causing a data loss
- only a single entity with key "X" will remain (and that's absolutely
correct - Kafka doesn't "understand" what does the message contain).

We were thinking about introducing something we internally called
"composite key". The idea is to have a key that's not just a single String
K, but a pair of Strings: (K1, K2). For specifying the partition that the
message should be sent to, K1 would be used; however, for log compaction
purposes, the whole (K1, K2) would be used instead. This way, referring to
the example above, different entities "belonging" to the same "master
entity" (product), could be published to that topic with composite keys:
(productId, "product"), (productId, "media") and (productId, "review"), so
they all end up in single partition (specified by K1, which is always:
productId), but they won't get compacted together, because the K2 part is
different for them, making the whole "composite key" (K1, K2) different. Of
course K2 would be optional, so for someone who only needs the default
behaviour nothing would change.

Since I'm not a Kafka developer and I don't know its internals that well, I
can't say if this idea is technically feasible or not, but I'd think it is
- I'd be more afraid of the complexity around backwards compatibility etc.
and potential performance implications of such change.

I know that similar behaviour is achievable by using the producer API that
allows explicitly specifying the partition ID (and the key), but I think
it's a bit "clunky" (for each message, generate a key that this message
should normally be using [productId] and somehow "map" that key into a
partition X; then send that message to this partition X, *but* use the
"compaction" key instead [productId, entity type] as the message key) and
it's something that could be abstracted away from the user.

Thoughts?

Question to Kafka users: Is this something that anyone here would find
useful? Is anyone here dealing with similar problem?

Question to Kafka maintainers: Is this something that you could potentially
consider a useful feature? Would it be worth a KIP? Is something like this
(technically) doable at all?

--
Kind regards,
Michał Michalski

The idea of "composite key" to make log compaction more flexible - question / proposal

Reply via email to