Good question! Here's my understanding. The streams API has a config num.standby.replicas. If this value is set to 0, the default, then the local state will have to be recreated by re-reading the relevant Kafka partition and replaying that into the state store, and as you point out this will take time proportional to the amount of data. If you set this value to something more than 0, then a "standby task" will be kept on one of the other instances. This standby won't do any processing it will just passively replicate the state changes of the primary task; in the event of a failure this standby task will be able to take over very quickly because it already has the full state pre-created.
So you have a choice of redundancy in either "time" (by replaying data) or "space" (by storing multiple copies). (Hopefully that's correct, I don't have the firmest grasp on how the standby tasks work.) -Jay On Thu, Dec 15, 2016 at 6:10 PM, Anatoly Pulyaevskiy < anatoly.pulyaevs...@gmail.com> wrote: > Hi everyone, > > I've been reading a lot about new features in Kafka Streams and everything > looks very promising. There is even an article on Kafka and Event Sourcing: > https://www.confluent.io/blog/event-sourcing-cqrs-stream- > processing-apache-kafka-whats-connection/ > > There are a couple of things that I'm concerned about though. For Event > Sourcing it is assumed that there is a way to fetch all events for a > particular object and replay them in order to get "latest snapshot" of that > object. > > It seems like (and the article says so) that StateStore in KafkaStreams can > be used to achieve that. > > My first question is would it scale well for millions of objects? > I understand that StateStore is backed by a compacted Kafka topic so in an > event of failure KafkaStreams will recover to the latest state by reading > all messages from that topic. But my suspicion is that for millions of > objects this may take a while (it would need to read the whole partition > for each object), is this a correct assumption? > > My second question is would it make more sense to use an external DB in > such case or is there a "best practice" around implementing Event Sourcing > and using Kafka's internal StateStore as EventStore? > > Thanks, > Anatoly >