just my 2 cents

the best answer is always from the real-world practices :)

RocksDB https://rocksdb.org/ is the implementation of "state store" in Kafka 
Stream and it is an "embedded" kv store (which is diff than the distributed kv 
store). The "state store" in Kafka Stream is also backed up by "changelog" 
topic, where the physical kv data is stored.

The performance hit may happen if:
(1) one of application node (that runs kafka stream, since kafka stream is a 
library) is gone, the "state store" has to be rebuilt from changelog topic and 
if the changelog topic is huge, the rebuild time could be long.
(2) the stream topology is complex with multiple state store / aggregation or 
called "reduce" operations, the rebuild or recovery time after failure could be 
long.

`num.standby.replicas` should help to significantly reduce the rebuild time, 
but it comes with the storage cost, since the "state store" is replicated at a 
different node.





On 2021/03/16 01:11:00, Gareth Collins <gareth.o.coll...@gmail.com> wrote: 
> Hi,
> 
> We have a requirement to calculate metrics on a huge number of keys (could
> be hundreds of millions, perhaps billions of keys - attempting caching on
> individual keys in many cases will have almost a 0% cache hit rate). Is
> Kafka Streams with RocksDB and compacting topics the right tool for a task
> like that?
> 
> As well, just from playing with Kafka Streams for a week it feels like it
> wants to create a lot of separate stores by default (if I want to calculate
> aggregates on five, ten and 30 days I will get three separate stores by
> default for this state data). Coming from a different distributed storage
> solution, I feel like I want to put them together in one store as I/O has
> always been my bottleneck (1 big read and 1 big write is better than three
> small separate reads and three small separate writes).
> 
> But am I perhaps missing something here? I don't want to avoid the DSL that
> Kafka Streams provides if I don't have to. Will the Kafka Streams RocksDB
> solution be so much faster than a distributed read that it won't be the
> bottleneck even with huge amounts of data?
> 
> Any info/opinions would be greatly appreciated.
> 
> thanks in advance,
> Gareth Collins
> 

Reply via email to