> If I understand this correctly: assuming I have a simple aggregator > distributed across n-docker instances each instance will _also_ need to > support some sort of communications process for allowing access to its > statestore (last param from KStream.groupby.aggregate).
Yes. See http://docs.confluent.io/current/streams/developer-guide.html#your-application-and-interactive-queries . > - The tombstoning facilities of redis or C* would lend themselves well to > implementing a 'true' rolling aggregation What is a 'true' rolling aggregation, and how could Redis or C* help with that in a way that Kafka can't? (Honest question.) > I get that RocksDB has a small footprint but given the choice of > implementing my own RPC / gossip-like process for data sharing and using a > well tested one (ala C* or redis) I would almost always opt for the latter. > [...] > Just my $0.02. I would love to hear why Im missing the 'big picture'. The > kstreams architecture seems rife with potential. One question is, for example: can the remote/central DB of your choice (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle? Over the network? At the same low latency? Also, what happens if the remote DB is unavailable? Do you wait and retry? Discard? Accept the fact that your app's processing latency will now go through the roof? I wrote about some such scenarios at https://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams/ . One big advantage (for many use cases, not for) with Kafka/Kafka Streams is that you can leverage fault-tolerant *local* state that may also be distributed across app instances. Local state is much more efficient and faster when doing stateful processing such as joins or aggregations. You don't need to worry about an external system, whether it's up and running, whether its version is still compatible with your app, whether it can scale as much as your app/Kafka Streams/Kafka/the volume of your input data. Also, note that some users have actually opted to run hybrid setups: Some processing output is sent to a remote data store like Cassandra (e.g. via Kafka Connect), some processing output is exposed directly through interactive queries. It's not like your forced to pick only one approach. > - Typical microservices would separate storing / retrieving data I'd rather argue that for microservices you'd oftentimes prefer to *not* use a remote DB, and rather do everything inside your microservice whatever the microservice needs to do (perhaps we could relax this to "do everything in a way that your microservices is in full, exclusive control", i.e. it doesn't necessarily need to be *inside*, but arguably it would be better if it actually is). See e.g. the article https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/ that lists some of the reasoning behind this school of thinking. Again, YMMV. Personally, I think there's no simple true/false here. The decisions depend on what you need, what your context is, etc. Anyways, since you already have some opinions for the one side, I wanted to share some food for thought for the other side of the argument. :-) Best, Michael On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jon.yearg...@cedexis.com> wrote: > If I understand this correctly: assuming I have a simple aggregator > distributed across n-docker instances each instance will _also_ need to > support some sort of communications process for allowing access to its > statestore (last param from KStream.groupby.aggregate). > > How would one go about substituting a separated db (EG redis) for the > statestore? > > Some advantages to decoupling: > - It would seem like having a centralized process like this would alleviate > the need to execute multiple requests for a given kv pair (IE "who has this > data?" and subsequent requests to retrieve it). > - it would take some pressure off of each node to maintain a large disk > store > - Typical microservices would separate storing / retrieving data > - It would raise some eyebrows if a spec called for a mysql/nosql instance > to be installed with every docker container > - The tombstoning facilities of redis or C* would lend themselves well to > implementing a 'true' rolling aggregation > > I get that RocksDB has a small footprint but given the choice of > implementing my own RPC / gossip-like process for data sharing and using a > well tested one (ala C* or redis) I would almost always opt for the latter. > (Footnote: Our implementations already heavily use redis/memcached for > deduplication of kafka messages so it would seem a small step to use the > same to store aggregation results.) > > Just my $0.02. I would love to hear why Im missing the 'big picture'. The > kstreams architecture seems rife with potential. > > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <matth...@confluent.io> > wrote: > > > The config does not "do" anything. It's metadata that get's broadcasted > > to other Streams instances for IQ feature. > > > > See this blog post for more details: > > https://www.confluent.io/blog/unifying-stream-processing- > > and-interactive-queries-in-apache-kafka/ > > > > Happy to answer any follow up question. > > > > > > -Matthias > > > > On 3/23/17 11:51 AM, Jon Yeargers wrote: > > > What does this config param do? > > > > > > I see it referenced / used in some samples and here ( > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > 67%3A+Queryable+state+for+Kafka+Streams > > > ) > > > > > > > >