Hi, TL;DR: I'd love to be able to make log compaction more "granular" than just per-partition-key, so I was thinking about the concept of a "composite key", where partitioning logic is using one part of the key, while compaction uses the whole key - is this something desirable / doable / worth a KIP?
Longer story / use case: I'm currently a member of a team working on a project that's using a bunch of applications to ingest data to the system (one "entity type" per app). Once ingested by each application, since the entities are referring to each other, they're all published to a single topic to ensure ordering for later processing stages. Because of the nature of the data, for a given set of entities related together, there's always a single "master" / parent" entity, which ID we're using as the partition key; to give an example: let's say you have "product" entity which can have things like "media", "reviews", "stocks" etc. associated with it - product ID will be the partition key for *all* these entities. However, with this approach we simply cannot use log compaction because having e.g. "product", "media" and "review" events, all with the same partition key "X", means that compaction process will at some point delete all but one of them, causing a data loss - only a single entity with key "X" will remain (and that's absolutely correct - Kafka doesn't "understand" what does the message contain). We were thinking about introducing something we internally called "composite key". The idea is to have a key that's not just a single String K, but a pair of Strings: (K1, K2). For specifying the partition that the message should be sent to, K1 would be used; however, for log compaction purposes, the whole (K1, K2) would be used instead. This way, referring to the example above, different entities "belonging" to the same "master entity" (product), could be published to that topic with composite keys: (productId, "product"), (productId, "media") and (productId, "review"), so they all end up in single partition (specified by K1, which is always: productId), but they won't get compacted together, because the K2 part is different for them, making the whole "composite key" (K1, K2) different. Of course K2 would be optional, so for someone who only needs the default behaviour nothing would change. Since I'm not a Kafka developer and I don't know its internals that well, I can't say if this idea is technically feasible or not, but I'd think it is - I'd be more afraid of the complexity around backwards compatibility etc. and potential performance implications of such change. I know that similar behaviour is achievable by using the producer API that allows explicitly specifying the partition ID (and the key), but I think it's a bit "clunky" (for each message, generate a key that this message should normally be using [productId] and somehow "map" that key into a partition X; then send that message to this partition X, *but* use the "compaction" key instead [productId, entity type] as the message key) and it's something that could be abstracted away from the user. Thoughts? Question to Kafka users: Is this something that anyone here would find useful? Is anyone here dealing with similar problem? Question to Kafka maintainers: Is this something that you could potentially consider a useful feature? Would it be worth a KIP? Is something like this (technically) doable at all? -- Kind regards, MichaĆ Michalski