Yes, agreed -- the re-thinking pre-existing notions is a big part of such
conversations.  A bit like making the mental switch from object-oriented
programming to functional programming -- and, just like in this case,
neither is more "right" than the other.  Personal
opinion/preference/context matters a lot, hence I tried to carefully phrase
my answer in a way that it doesn't come across as potentially
indoctrinating. ;-)

On Fri, Mar 24, 2017 at 6:34 PM, Jon Yeargers <jon.yearg...@cedexis.com>
wrote:

> You make some great cases for your architecture. To be clear - Ive been
> proselytizing for kafka since I joined this company last year. I think my
> largest issue is rethinking some preexisting notions about streaming to
> make them work in the kstream universe.
>
> On Fri, Mar 24, 2017 at 6:07 AM, Michael Noll <mich...@confluent.io>
> wrote:
>
> > > If I understand this correctly: assuming I have a simple aggregator
> > > distributed across n-docker instances each instance will _also_ need to
> > > support some sort of communications process for allowing access to its
> > > statestore (last param from KStream.groupby.aggregate).
> >
> > Yes.
> >
> > See
> > http://docs.confluent.io/current/streams/developer-
> > guide.html#your-application-and-interactive-queries
> > .
> >
> > > - The tombstoning facilities of redis or C* would lend themselves well
> to
> > > implementing a 'true' rolling aggregation
> >
> > What is a 'true' rolling aggregation, and how could Redis or C* help with
> > that in a way that Kafka can't?  (Honest question.)
> >
> >
> > > I get that RocksDB has a small footprint but given the choice of
> > > implementing my own RPC / gossip-like process for data sharing and
> using
> > a
> > > well tested one (ala C* or redis) I would almost always opt for the
> > latter.
> > > [...]
> > > Just my $0.02. I would love to hear why Im missing the 'big picture'.
> The
> > > kstreams architecture seems rife with potential.
> >
> > One question is, for example:  can the remote/central DB of your choice
> > (Redis, C*) handle as many qps as Kafka/Kafka Streams/RocksDB can handle?
> > Over the network?  At the same low latency?  Also, what happens if the
> > remote DB is unavailable?  Do you wait and retry?  Discard?  Accept the
> > fact that your app's processing latency will now go through the roof?  I
> > wrote about some such scenarios at
> > https://www.confluent.io/blog/distributed-real-time-joins-
> > and-aggregations-on-user-activity-events-using-kafka-streams/
> > .
> >
> > One big advantage (for many use cases, not for) with Kafka/Kafka Streams
> is
> > that you can leverage fault-tolerant *local* state that may also be
> > distributed across app instances.  Local state is much more efficient and
> > faster when doing stateful processing such as joins or aggregations.  You
> > don't need to worry about an external system, whether it's up and
> running,
> > whether its version is still compatible with your app, whether it can
> scale
> > as much as your app/Kafka Streams/Kafka/the volume of your input data.
> >
> > Also, note that some users have actually opted to run hybrid setups:
> Some
> > processing output is sent to a remote data store like Cassandra (e.g. via
> > Kafka Connect), some processing output is exposed directly through
> > interactive queries.  It's not like your forced to pick only one
> approach.
> >
> >
> > > - Typical microservices would separate storing / retrieving data
> >
> > I'd rather argue that for microservices you'd oftentimes prefer to *not*
> > use a remote DB, and rather do everything inside your microservice
> whatever
> > the microservice needs to do (perhaps we could relax this to "do
> everything
> > in a way that your microservices is in full, exclusive control", i.e. it
> > doesn't necessarily need to be *inside*, but arguably it would be better
> if
> > it actually is).
> > See e.g. the article
> > https://www.confluent.io/blog/data-dichotomy-rethinking-the-
> > way-we-treat-data-and-services/
> > that lists some of the reasoning behind this school of thinking.  Again,
> > YMMV.
> >
> > Personally, I think there's no simple true/false here.  The decisions
> > depend on what you need, what your context is, etc.  Anyways, since you
> > already have some opinions for the one side, I wanted to share some food
> > for thought for the other side of the argument. :-)
> >
> > Best,
> > Michael
> >
> >
> >
> >
> >
> > On Fri, Mar 24, 2017 at 1:25 PM, Jon Yeargers <jon.yearg...@cedexis.com>
> > wrote:
> >
> > > If I understand this correctly: assuming I have a simple aggregator
> > > distributed across n-docker instances each instance will _also_ need to
> > > support some sort of communications process for allowing access to its
> > > statestore (last param from KStream.groupby.aggregate).
> > >
> > > How would one go about substituting a separated db (EG redis) for the
> > > statestore?
> > >
> > > Some advantages to decoupling:
> > > - It would seem like having a centralized process like this would
> > alleviate
> > > the need to execute multiple requests for a given kv pair (IE "who has
> > this
> > > data?" and subsequent requests to retrieve it).
> > > - it would take some pressure off of each node to maintain a large disk
> > > store
> > > - Typical microservices would separate storing / retrieving data
> > > - It would raise some eyebrows if a spec called for a mysql/nosql
> > instance
> > > to be installed with every docker container
> > > - The tombstoning facilities of redis or C* would lend themselves well
> to
> > > implementing a 'true' rolling aggregation
> > >
> > > I get that RocksDB has a small footprint but given the choice of
> > > implementing my own RPC / gossip-like process for data sharing and
> using
> > a
> > > well tested one (ala C* or redis) I would almost always opt for the
> > latter.
> > > (Footnote: Our implementations already heavily use redis/memcached for
> > > deduplication of kafka messages so it would seem a small step to use
> the
> > > same to store aggregation results.)
> > >
> > > Just my $0.02. I would love to hear why Im missing the 'big picture'.
> The
> > > kstreams architecture seems rife with potential.
> > >
> > > On Thu, Mar 23, 2017 at 3:17 PM, Matthias J. Sax <
> matth...@confluent.io>
> > > wrote:
> > >
> > > > The config does not "do" anything. It's metadata that get's
> broadcasted
> > > > to other Streams instances for IQ feature.
> > > >
> > > > See this blog post for more details:
> > > > https://www.confluent.io/blog/unifying-stream-processing-
> > > > and-interactive-queries-in-apache-kafka/
> > > >
> > > > Happy to answer any follow up question.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 3/23/17 11:51 AM, Jon Yeargers wrote:
> > > > > What does this config param do?
> > > > >
> > > > > I see it referenced / used in some samples and here (
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 67%3A+Queryable+state+for+Kafka+Streams
> > > > > )
> > > > >
> > > >
> > > >
> > >
> >
>

Reply via email to