Re: Kafka for event sourcing architecture

Christian Posta Tue, 17 May 2016 17:01:31 -0700

Please create a JIRA with your thoughts. I'd be happy to help out with
something like that.


On Tue, May 17, 2016 at 4:57 PM, Radoslaw Gruchalski <ra...@gruchalski.com>
wrote:

> Not as far as I'm aware. I'd be happy to contribute if there is a desire
> to have such feature. From experience with other projects, I know that
> without the initial pitch / discussion, it could be difficult to get such
> feature in. I can create a jira in the morning, no electricity again
> tonight :-/
>
> Get Outlook for iOS
>
>
>
>
> On Tue, May 17, 2016 at 4:53 PM -0700, "Christian Posta" <
> christian.po...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
>
> +1 to your solution of log.cleanup.policy. Other brokers (ie, ActiveMQ)
> have a feature like that.
> Is there a JIRA for this?
>
> On Tue, May 17, 2016 at 4:48 PM, Radoslaw Gruchalski
> wrote:
>
> > I have described a cold storage solution for Kafka:
> >
> https://medium.com/@rad_g/the-case-for-kafka-cold-storage-32929d0a57b2#.kf0jf8cwv
> .
> > Also described it here a couple of times. Thd potential solution seems
> > rather straightforward.
> > Get Outlook for iOS
> >
> >     _____________________________
> > From: Luke Steensen
> > Sent: Tuesday, May 17, 2016 11:22 pm
> > Subject: Re: Kafka for event sourcing architecture
> > To:
> >
> >
> > It's harder in Kafka because the unit of replication is an entire
> > partition, not a single key/value pair. Partitions are large and
> constantly
> > growing, where key/value pairs are typically much smaller and don't
> change
> > in size. There would theoretically be no difference if you had one
> > partition per key, but that's not practical. Instead, you end up trying
> to
> > pick a number of partitions big enough that they'll each be a reasonable
> > size for the foreseeable future but not so big that the cluster overhead
> is
> > untenable. Even then the clock is ticking towards the day your biggest
> > partition approaches the limit of storage available on a single machine.
> >
> > It's frustrating because, as you say, there would be enormous benefits to
> > being able to access all data through the same system. Unfortunately, it
> > seems too far away from Kafka's original use case to be practical.
> >
> >
> > On Tue, May 17, 2016 at 12:32 PM, Daniel Schierbeck <
> > da...@zendesk.com.invalid> wrote:
> >
> > > I'm not sure why Kafka at least in theory cannot be used for infinite
> > > retention – any replicated database system would need to have a new
> node
> > > ingest all the data from failed node from its replicas. Surely this is
> no
> > > different in S3 itself. Why is this harder to do in Kafka than in other
> > > systems? The benefit of having just a single message log system would
> be
> > > rather big.
> > >
> > > On Tue, May 17, 2016 at 4:44 AM Tom Crayford
> > wrote:
> > >
> > > > Hi Oli,
> > > >
> > > > Inline.
> > > >
> > > > On Tuesday, 17 May 2016, Olivier Lalonde  wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am considering adopting an "event sourcing" architecture for a
> > > system I
> > > > > am developing and Kafka seems like a good choice of store for
> events.
> > > > >
> > > > > For those who aren't aware, this architecture style consists in
> > storing
> > > > all
> > > > > state changes of the system as an ordered log of events and
> building
> > > > > derivative views as needed for easier querying (using a SQL
> database
> > > for
> > > > > example). Those views must be completely derived from the event log
> > > alone
> > > > > so that the log effectively becomes a "single source of truth".
> > > > >
> > > > > I was wondering if anyone else is using Kafka for that purpose and
> > more
> > > > > specifically:
> > > > >
> > > > > 1) Can Kafka store messages permanently?
> > > >
> > > >
> > > > No. Whilst you can tweak config and such to get a very long retention
> > > > period, this doesn't work well with Kafka at all. Keeping data around
> > > > forever has severe impacts on the operability of your cluster. For
> > > example,
> > > > if a machine fails, a replacement would have to catch up with vast
> > > > quantities of data from its replicas. Currently we (Heroku Kafka)
> > > restrict
> > > > our customers to a maximum of 14 days of retention, because of all
> the
> > > > operational headaches of more retention than that. Of course on your
> > own
> > > > cluster you *can* set it as high as you like, this is just an
> anecdotal
> > > > experience thing from a team that runs thousands of clusters -
> infinite
> > > > retention is an operational disaster waiting to happen.
> > > >
> > > > Whilst Kafka does have a replay mechanism, that should mostly be
> though
> > > of
> > > > as a mechanism for handling other system failures. E.g. If the
> database
> > > you
> > > > store indexed views is is down, Kafka's replay and retention
> mechanisms
> > > > mean you're not losing data whilst restoring the availability of that
> > > > database.
> > > >
> > > > What we typically suggest customers do when they ask about this use
> > case
> > > is
> > > > to use Kafka as a messaging system, but use e.g. S3 as the long term
> > > store.
> > > > Kafka can help with batching writes up to S3 (see e.g. Pintrest's
> secor
> > > > project), and act as a very high throughput, durable, replicated
> > > messaging
> > > > layer for communication. In this paradigm, when you want to replay,
> you
> > > do
> > > > so out of S3 until you've consumed the last offset there, then start
> > > > replaying out of and catching up with the small amount of remaining
> > data
> > > in
> > > > Kafka. Of course the replay logic there has to be hand rolled, as
> Kafka
> > > and
> > > > its clients have no knowledge of external stores.
> > > >
> > > > Another potential thing to look at is Kafka's compacted topic
> > mechanism.
> > > > With compacted topics, Kafka keeps the latest element for a given
> key,
> > > > making it act a little more like a database table. Note that you
> still
> > > have
> > > > to consume by offset here - there's no "get the value for key Y
> > > > operation". However, this assumes that your keyspace is still
> tractably
> > > > small, and that you're ok with keeping only the latest value.
> > Compaction
> > > > completely overrides time based retention, so you have to "delete"
> keys
> > > or
> > > > have a bounded keyspace if you want to retain operational sanity with
> > > > Kafka. I'd recommend reading the docs on compacted topics, they cover
> > the
> > > > use cases quite well.
> > > >
> > > >
> > > >
> > > > >
> > > > > 2) Let's say I throw away my derived view and want to re-build it
> > from
> > > > > scratch, is it possible to consume messages from a topic from its
> > very
> > > > > first message and once it has caught up, listen for new messages
> like
> > > it
> > > > > would normally do?
> > > >
> > > >
> > > > That's entirely possible, you can catch up from the first retained
> > > message
> > > > and then continue from there very easily. However, see above about
> > > infinite
> > > > retention.
> > > >
> > > >
> > > >
> > > > >
> > > > > 2) Does it support transactions? Let's say I want to push 3
> messages
> > > > > atomically but the producer process crashes after sending only 2
> > > > messages,
> > > > > is it possible to "rollback" the first 2 messages (e.g. "all or
> > > nothing"
> > > > > semantics)?
> > > >
> > > >
> > > > No. Kafka at the moment only supports "at least once" semantics, and
> > > there
> > > > are no cross broker transactions of any kind. Implementing such a
> thing
> > > > would likely have huge negative impacts on the current performance
> > > > characteristics of Kafka, which would be a issue for many users.
> > > >
> > > >
> > > > >
> > > > > 3) Does it support request/response style semantics or can they be
> > > > > simulated? My system's primary interface with the outside world is
> an
> > > > HTTP
> > > > > API so it would be nice if I could publish an event and wait for
> all
> > > the
> > > > > internal services which need to process the event to be "done"
> > > > > processing before returning a response.
> > > >
> > > >
> > > >
> > > > In theory that's possible - the producer can return the offset of the
> > > > message produced, and you could check the latest offset of each
> > consumer
> > > in
> > > > your web request handler.
> > > >
> > > > However, doing so is not going to work that well, unless you're ok
> with
> > > > your web requests taking on the order of seconds to tens of seconds
> to
> > > > fulfill. Kafka can do low latency messaging reasonably well, but
> > > > coordinating the offsets of many consumers would likely have a huge
> > > latency
> > > > impact. Writing the code for it and getting it handling failure
> > correctly
> > > > would likely be a lot of work (there's nothing in any of the client
> > > > libraries like this, because it is not a desirable or supported use
> > > case).
> > > >
> > > > Instead I'd like to query *why* you need those semantics? What's the
> > > issue
> > > > with just producing a message and telling the user HTTP 200 and later
> > > > consuming it.
> > > >
> > > >
> > > >
> > > > >
> > > > > PS: I'm a Node.js/Go developer so when possible please avoid Java
> > > centric
> > > > > terminology.
> > > >
> > > >
> > > > Please to note that the node and go clients are notably less mature
> > than
> > > > the JVM clients, and that running Kafka in production means knowing
> > > enough
> > > > about the JVM and Zookeeper to handle that.
> > > >
> > > > Thanks!
> > > > Tom Crayford
> > > > Heroku Kafka
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > - Oli
> > > >
> > > > >
> > > > > --
> > > > > - Oli
> > > > >
> > > > > Olivier Lalonde
> > > > > http://www.syskall.com
> > > >  <-- connect with me!
> > > > >
> > > >
> > >
> >
> >
> >
> >
> >
>
>
>
> --
> *Christian Posta*
> twitter: @christianposta
> http://www.christianposta.com/blog
> http://fabric8.io
>
>
>
>
>
>


-- 
*Christian Posta*
twitter: @christianposta
http://www.christianposta.com/blog
http://fabric8.io

Re: Kafka for event sourcing architecture

Reply via email to