I think you can do this now by using a custom partitioner, no? https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/producer/Partitioner.html
-Jay On Mon, Oct 2, 2017 at 6:29 AM Michal Michalski <michal.michal...@zalando.ie> wrote: > Hi, > > TL;DR: I'd love to be able to make log compaction more "granular" than just > per-partition-key, so I was thinking about the concept of a "composite > key", where partitioning logic is using one part of the key, while > compaction uses the whole key - is this something desirable / doable / > worth a KIP? > > Longer story / use case: > > I'm currently a member of a team working on a project that's using a bunch > of applications to ingest data to the system (one "entity type" per app). > Once ingested by each application, since the entities are referring to each > other, they're all published to a single topic to ensure ordering for later > processing stages. Because of the nature of the data, for a given set of > entities related together, there's always a single "master" / parent" > entity, which ID we're using as the partition key; to give an example: > let's say you have "product" entity which can have things like "media", > "reviews", "stocks" etc. associated with it - product ID will be the > partition key for *all* these entities. However, with this approach we > simply cannot use log compaction because having e.g. "product", "media" and > "review" events, all with the same partition key "X", means that compaction > process will at some point delete all but one of them, causing a data loss > - only a single entity with key "X" will remain (and that's absolutely > correct - Kafka doesn't "understand" what does the message contain). > > We were thinking about introducing something we internally called > "composite key". The idea is to have a key that's not just a single String > K, but a pair of Strings: (K1, K2). For specifying the partition that the > message should be sent to, K1 would be used; however, for log compaction > purposes, the whole (K1, K2) would be used instead. This way, referring to > the example above, different entities "belonging" to the same "master > entity" (product), could be published to that topic with composite keys: > (productId, "product"), (productId, "media") and (productId, "review"), so > they all end up in single partition (specified by K1, which is always: > productId), but they won't get compacted together, because the K2 part is > different for them, making the whole "composite key" (K1, K2) different. Of > course K2 would be optional, so for someone who only needs the default > behaviour nothing would change. > > Since I'm not a Kafka developer and I don't know its internals that well, I > can't say if this idea is technically feasible or not, but I'd think it is > - I'd be more afraid of the complexity around backwards compatibility etc. > and potential performance implications of such change. > > I know that similar behaviour is achievable by using the producer API that > allows explicitly specifying the partition ID (and the key), but I think > it's a bit "clunky" (for each message, generate a key that this message > should normally be using [productId] and somehow "map" that key into a > partition X; then send that message to this partition X, *but* use the > "compaction" key instead [productId, entity type] as the message key) and > it's something that could be abstracted away from the user. > > Thoughts? > > Question to Kafka users: Is this something that anyone here would find > useful? Is anyone here dealing with similar problem? > > Question to Kafka maintainers: Is this something that you could potentially > consider a useful feature? Would it be worth a KIP? Is something like this > (technically) doable at all? > > -- > Kind regards, > MichaĆ Michalski >