Hello Anatoly,

Jay's understanding about the time-space tradeoff in state store HA is
correct. As for your second question on "best practices": Streams
high-level DSL has a notion of KTable in addition to its first-class
KStream objects, which can be viewed as a materialized view of the stream,
which is optionally backed up by a state store. I think this is a natural
fit for event store use cases such that assuming your change events are
stored in a Kafka topic "topic1", you can then in your Streams app creates
a materialized view as:


KTable table1 = topologyBuilder.table("topic1", "store1");


Or you can construct an aggregated materialized view from the events of
streams (say in another topic "topic2") as:


KTable table2 =
topologyBuilder.stream("topic2").groupBy(...).aggregate(..., "store2");


Then the backing up state stores "store1" and "store2" can be viewed as the
current snapshots of the materialized view of the change events, and users
can query them simply as querying a normal store engine:

https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/


Guozhang

On Fri, Dec 16, 2016 at 11:27 AM, Jay Kreps <j...@confluent.io> wrote:

> Good question! Here's my understanding.
>
> The streams API has a config num.standby.replicas. If this value is set to
> 0, the default, then the local state will have to be recreated by
> re-reading the relevant Kafka partition and replaying that into the state
> store, and as you point out this will take time proportional to the amount
> of data. If you set this value to something more than 0, then a "standby
> task" will be kept on one of the other instances. This standby won't do any
> processing it will just passively replicate the state changes of the
> primary task; in the event of a failure this standby task will be able to
> take over very quickly because it already has the full state pre-created.
>
> So you have a choice of redundancy in either "time" (by replaying data) or
> "space" (by storing multiple copies).
>
> (Hopefully that's correct, I don't have the firmest grasp on how the
> standby tasks work.)
>
> -Jay
>
> On Thu, Dec 15, 2016 at 6:10 PM, Anatoly Pulyaevskiy <
> anatoly.pulyaevs...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I've been reading a lot about new features in Kafka Streams and
> everything
> > looks very promising. There is even an article on Kafka and Event
> Sourcing:
> > https://www.confluent.io/blog/event-sourcing-cqrs-stream-
> > processing-apache-kafka-whats-connection/
> >
> > There are a couple of things that I'm concerned about though. For Event
> > Sourcing it is assumed that there is a way to fetch all events for a
> > particular object and replay them in order to get "latest snapshot" of
> that
> > object.
> >
> > It seems like (and the article says so) that StateStore in KafkaStreams
> can
> > be used to achieve that.
> >
> > My first question is would it scale well for millions of objects?
> > I understand that StateStore is backed by a compacted Kafka topic so in
> an
> > event of failure KafkaStreams will recover to the latest state by reading
> > all messages from that topic. But my suspicion is that for millions of
> > objects this may take a while (it would need to read the whole partition
> > for each object), is this a correct assumption?
> >
> > My second question is would it make more sense to use an external DB in
> > such case or is there a "best practice" around implementing Event
> Sourcing
> > and using Kafka's internal StateStore as EventStore?
> >
> > Thanks,
> > Anatoly
> >
>



-- 
-- Guozhang

Reply via email to