Yes, agreed -- the re-thinking pre-existing notions is a big part of such conversations. A bit like making the mental switch from object-oriented programming to functional programming -- and, just like in this case, neither is more "right" than the other. Personal opinion/preference/context matters a lot, hence I tried to carefully phrase my answer in a way that it doesn't come across as potentially indoctrinating. ;-)
On Fri, Mar 24, 2017 at 6:34 PM, Jon Yeargers <jon.yearg...@cedexis.com> wrote: > You make some great cases for your architecture. To be clear - Ive been > proselytizing for kafka since I joined this company last year. I think my > largest issue is rethinking some preexisting notions about streaming to > make them work in the kstream universe. > > On Fri, Mar 24, 2017 at 6:07 AM, Michael Noll <mich...@confluent.io> > wrote: > > > > If I understand this correctly: assuming I have a simple aggregator > > > distributed across n-docker instances each instance will _also_ need to > > > support some sort of communications process for allowing access to its > > > statestore (last param from KStream.groupby.aggregate). > > > > Yes. > > > > See > > http://docs.confluent.io/current/streams/developer- > > guide.html#your-application-and-interactive-queries > > . > > > > > - The tombstoning facilities of redis or C* would lend themselves well > to > > > implementing a 'true' rolling aggregation > > > > What is a 'true' rolling aggregation, and how could Redis or C* help with > > that in a way that Kafka can't? (Honest question.) > > > > > > > I get that RocksDB has a small footprint but given the choice of > > > implementing my own RPC / gossip-like process for data sharing and > using > > a > > > well tested one (ala C* or redis) I would almost always opt for the > > latter. > > > [...] > > > Just my $0.02. I would love to hear why Im missing the 'big picture'. > The > > > kstreams architecture seems rife with potential. > > > > One question is, for example: can the remote/central DB of your choice > > (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle? > > Over the network? At the same low latency? Also, what happens if the > > remote DB is unavailable? Do you wait and retry? Discard? Accept the > > fact that your app's processing latency will now go through the roof? I > > wrote about some such scenarios at > > https://www.confluent.io/blog/distributed-real-time-joins- > > and-aggregations-on-user-activity-events-using-kafka-streams/ > > . > > > > One big advantage (for many use cases, not for) with Kafka/Kafka Streams > is > > that you can leverage fault-tolerant *local* state that may also be > > distributed across app instances. Local state is much more efficient and > > faster when doing stateful processing such as joins or aggregations. You > > don't need to worry about an external system, whether it's up and > running, > > whether its version is still compatible with your app, whether it can > scale > > as much as your app/Kafka Streams/Kafka/the volume of your input data. > > > > Also, note that some users have actually opted to run hybrid setups: > Some > > processing output is sent to a remote data store like Cassandra (e.g. via > > Kafka Connect), some processing output is exposed directly through > > interactive queries. It's not like your forced to pick only one > approach. > > > > > > > - Typical microservices would separate storing / retrieving data > > > > I'd rather argue that for microservices you'd oftentimes prefer to *not* > > use a remote DB, and rather do everything inside your microservice > whatever > > the microservice needs to do (perhaps we could relax this to "do > everything > > in a way that your microservices is in full, exclusive control", i.e. it > > doesn't necessarily need to be *inside*, but arguably it would be better > if > > it actually is). > > See e.g. the article > > https://www.confluent.io/blog/data-dichotomy-rethinking-the- > > way-we-treat-data-and-services/ > > that lists some of the reasoning behind this school of thinking. Again, > > YMMV. > > > > Personally, I think there's no simple true/false here. The decisions > > depend on what you need, what your context is, etc. Anyways, since you > > already have some opinions for the one side, I wanted to share some food > > for thought for the other side of the argument. :-) > > > > Best, > > Michael > > > > > > > > > > > > On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jon.yearg...@cedexis.com> > > wrote: > > > > > If I understand this correctly: assuming I have a simple aggregator > > > distributed across n-docker instances each instance will _also_ need to > > > support some sort of communications process for allowing access to its > > > statestore (last param from KStream.groupby.aggregate). > > > > > > How would one go about substituting a separated db (EG redis) for the > > > statestore? > > > > > > Some advantages to decoupling: > > > - It would seem like having a centralized process like this would > > alleviate > > > the need to execute multiple requests for a given kv pair (IE "who has > > this > > > data?" and subsequent requests to retrieve it). > > > - it would take some pressure off of each node to maintain a large disk > > > store > > > - Typical microservices would separate storing / retrieving data > > > - It would raise some eyebrows if a spec called for a mysql/nosql > > instance > > > to be installed with every docker container > > > - The tombstoning facilities of redis or C* would lend themselves well > to > > > implementing a 'true' rolling aggregation > > > > > > I get that RocksDB has a small footprint but given the choice of > > > implementing my own RPC / gossip-like process for data sharing and > using > > a > > > well tested one (ala C* or redis) I would almost always opt for the > > latter. > > > (Footnote: Our implementations already heavily use redis/memcached for > > > deduplication of kafka messages so it would seem a small step to use > the > > > same to store aggregation results.) > > > > > > Just my $0.02. I would love to hear why Im missing the 'big picture'. > The > > > kstreams architecture seems rife with potential. > > > > > > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax < > matth...@confluent.io> > > > wrote: > > > > > > > The config does not "do" anything. It's metadata that get's > broadcasted > > > > to other Streams instances for IQ feature. > > > > > > > > See this blog post for more details: > > > > https://www.confluent.io/blog/unifying-stream-processing- > > > > and-interactive-queries-in-apache-kafka/ > > > > > > > > Happy to answer any follow up question. > > > > > > > > > > > > -Matthias > > > > > > > > On 3/23/17 11:51 AM, Jon Yeargers wrote: > > > > > What does this config param do? > > > > > > > > > > I see it referenced / used in some samples and here ( > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 67%3A+Queryable+state+for+Kafka+Streams > > > > > ) > > > > > > > > > > > > > > > > > > >