Using log compaction is well-suited for applications that use Kafka directly and need to persist some state associated with its processing. So something like offset management for consumers <http://www.slideshare.net/jjkoshy/offset-management-in-kafka> is a good fit. Another good use-case is for storing schemas <https://github.com/confluentinc/schema-registry/blob/master/core/src/main/java/io/confluent/kafka/schemaregistry/storage/KafkaStore.java> associated with your Kafka topics. These are both very specific to maintaining metadata around your stream processing. Although it can be used for more general K-V storage it is not *always* a good fit. This is especially true if your key-space is bound to grow significantly over time or has an high update rate. The other aspect is the need to do some sort of caching of your key-value pairs (since otherwise lookups would require scanning the log). So for application-level general K-V storage, you could certainly use Kafka as a persistence mechanism for recording recent updates (with traditional time-based retention), but you would probably want a more suitable K-V store separate from Kafka. I'm not sure this (i.e., traditional db storage) is your use case since you mention "a lot of stream processing on these messages" - so it sounds more like repetitive processing over the entire key space. For that it may be more reasonable. The alternative is to use snapshots and read more recent updates from the updates stream in Kafka. Samza folks may want to weigh in here as well.
That said, to answer your question: sure it is feasible to use log compaction with 1B keys, especially if you have enough brokers, partitions, and log cleaner threads but I'm not sure it is the best approach to take. We did hit various issues (bugs/feature gaps) with log compaction while using it for consumer offset management: e.g., support for compressed messages, various other bugs, but most of these have been resolved. Hope that helps, Joel On Tue, Oct 6, 2015 at 8:34 PM, Feroze Daud <khic...@yahoo.com.invalid> wrote: > hi! > We have a use case where we want to store ~100m keys in kafka. Is there any problem with this approach? > I have heard from some people using kafka, that kafka has a problem when doing log compaction with those many number of keys. > Another topic might have around 10 different K/V pairs for each key in the primary topic. The primary topic's keyspace is approx of 100m keys. We would like to store this in kafka because we are doing a lot of stream processing on these messages, and want to avoid writing another process to recompute data from snapshots. > So, in summary: > primary topic: ~100m keyssecondary topic: ~1B keys > Is it feasible to use log compaction at such a scale of data? > Thanks > feroze.