Re: The idea of "composite key" to make log compaction more flexible - question / proposal

Jay Kreps Thu, 05 Oct 2017 07:22:48 -0700

I think you can do this now by using a custom partitioner, no?

https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/producer/Partitioner.html


-Jay

On Mon, Oct 2, 2017 at 6:29 AM Michal Michalski <michal.michal...@zalando.ie>
wrote:

> Hi,
>
> TL;DR: I'd love to be able to make log compaction more "granular" than just
> per-partition-key, so I was thinking about the concept of a "composite
> key", where partitioning logic is using one part of the key, while
> compaction uses the whole key - is this something desirable / doable /
> worth a KIP?
>
> Longer story / use case:
>
> I'm currently a member of a team working on a project that's using a bunch
> of applications to ingest data to the system (one "entity type" per app).
> Once ingested by each application, since the entities are referring to each
> other, they're all published to a single topic to ensure ordering for later
> processing stages. Because of the nature of the data, for a given set of
> entities related together, there's always a single "master" / parent"
> entity, which ID we're using as the partition key; to give an example:
> let's say you have "product" entity which can have things like "media",
> "reviews", "stocks" etc. associated with it - product ID will be the
> partition key for *all* these entities. However, with this approach we
> simply cannot use log compaction because having e.g. "product", "media" and
> "review" events, all with the same partition key "X", means that compaction
> process will at some point delete all but one of them, causing a data loss
> - only a single entity with key "X" will remain (and that's absolutely
> correct - Kafka doesn't "understand" what does the message contain).
>
> We were thinking about introducing something we internally called
> "composite key". The idea is to have a key that's not just a single String
> K, but a pair of Strings: (K1, K2). For specifying the partition that the
> message should be sent to, K1 would be used; however, for log compaction
> purposes, the whole (K1, K2) would be used instead. This way, referring to
> the example above, different entities "belonging" to the same "master
> entity" (product), could be published to that topic with composite keys:
> (productId, "product"), (productId, "media") and (productId, "review"), so
> they all end up in single partition (specified by K1, which is always:
> productId), but they won't get compacted together, because the K2 part is
> different for them, making the whole "composite key" (K1, K2) different. Of
> course K2 would be optional, so for someone who only needs the default
> behaviour nothing would change.
>
> Since I'm not a Kafka developer and I don't know its internals that well, I
> can't say if this idea is technically feasible or not, but I'd think it is
> - I'd be more afraid of the complexity around backwards compatibility etc.
> and potential performance implications of such change.
>
> I know that similar behaviour is achievable by using the producer API that
> allows explicitly specifying the partition ID (and the key), but I think
> it's a bit "clunky" (for each message, generate a key that this message
> should normally be using [productId] and somehow "map" that key into a
> partition X; then send that message to this partition X, *but* use the
> "compaction" key instead [productId, entity type] as the message key) and
> it's something that could be abstracted away from the user.
>
> Thoughts?
>
> Question to Kafka users: Is this something that anyone here would find
> useful? Is anyone here dealing with similar problem?
>
> Question to Kafka maintainers: Is this something that you could potentially
> consider a useful feature? Would it be worth a KIP? Is something like this
> (technically) doable at all?
>
> --
> Kind regards,
> Michał Michalski
>

Re: The idea of "composite key" to make log compaction more flexible - question / proposal

Reply via email to