Just want to chime on this question as this does seem a good option to avoid some memory hungry K,V store in case we are ok with some async processing. There are cases where you want a combination of some near realtime and some offline processing of the same index and as kafka topic is much efficient in terms of memory and you can scoop out the messages much faster on basis of the parallelism v/s introducing another component in your store just to iterate over the keys and produce results.
Considering Joel's feedback should we really avoid this option at all or should we shard it lot and then be ok to store approx. 500 million or a billion messages in the topic ? Thanks On Thu, Oct 8, 2015 at 12:46 PM, Jan Filipiak <jan.filip...@trivago.com> wrote: > Hi, > > just want to pick this up again. You can always use more partitions to > reduce the number of keys handled by a single broker and parallelize the > compaction. So with sufficient number of machines and the ability to > partition I don’t see you running into problems. > > Jan > > > On 07.10.2015 05:34, Feroze Daud wrote: > >> hi! >> We have a use case where we want to store ~100m keys in kafka. Is there >> any problem with this approach? >> I have heard from some people using kafka, that kafka has a problem when >> doing log compaction with those many number of keys. >> Another topic might have around 10 different K/V pairs for each key in >> the primary topic. The primary topic's keyspace is approx of 100m keys. We >> would like to store this in kafka because we are doing a lot of stream >> processing on these messages, and want to avoid writing another process to >> recompute data from snapshots. >> So, in summary: >> primary topic: ~100m keyssecondary topic: ~1B keys >> Is it feasible to use log compaction at such a scale of data? >> Thanks >> feroze. >> > >