Please create a JIRA with your thoughts. I'd be happy to help out with something like that.
On Tue, May 17, 2016 at 4:57 PM, Radoslaw Gruchalski <ra...@gruchalski.com> wrote: > Not as far as I'm aware. I'd be happy to contribute if there is a desire > to have such feature. From experience with other projects, I know that > without the initial pitch / discussion, it could be difficult to get such > feature in. I can create a jira in the morning, no electricity again > tonight :-/ > > Get Outlook for iOS > > > > > On Tue, May 17, 2016 at 4:53 PM -0700, "Christian Posta" < > christian.po...@gmail.com> wrote: > > > > > > > > > > > +1 to your solution of log.cleanup.policy. Other brokers (ie, ActiveMQ) > have a feature like that. > Is there a JIRA for this? > > On Tue, May 17, 2016 at 4:48 PM, Radoslaw Gruchalski > wrote: > > > I have described a cold storage solution for Kafka: > > > https://medium.com/@rad_g/the-case-for-kafka-cold-storage-32929d0a57b2#.kf0jf8cwv > . > > Also described it here a couple of times. Thd potential solution seems > > rather straightforward. > > Get Outlook for iOS > > > > _____________________________ > > From: Luke Steensen > > Sent: Tuesday, May 17, 2016 11:22 pm > > Subject: Re: Kafka for event sourcing architecture > > To: > > > > > > It's harder in Kafka because the unit of replication is an entire > > partition, not a single key/value pair. Partitions are large and > constantly > > growing, where key/value pairs are typically much smaller and don't > change > > in size. There would theoretically be no difference if you had one > > partition per key, but that's not practical. Instead, you end up trying > to > > pick a number of partitions big enough that they'll each be a reasonable > > size for the foreseeable future but not so big that the cluster overhead > is > > untenable. Even then the clock is ticking towards the day your biggest > > partition approaches the limit of storage available on a single machine. > > > > It's frustrating because, as you say, there would be enormous benefits to > > being able to access all data through the same system. Unfortunately, it > > seems too far away from Kafka's original use case to be practical. > > > > > > On Tue, May 17, 2016 at 12:32 PM, Daniel Schierbeck < > > da...@zendesk.com.invalid> wrote: > > > > > I'm not sure why Kafka at least in theory cannot be used for infinite > > > retention – any replicated database system would need to have a new > node > > > ingest all the data from failed node from its replicas. Surely this is > no > > > different in S3 itself. Why is this harder to do in Kafka than in other > > > systems? The benefit of having just a single message log system would > be > > > rather big. > > > > > > On Tue, May 17, 2016 at 4:44 AM Tom Crayford > > wrote: > > > > > > > Hi Oli, > > > > > > > > Inline. > > > > > > > > On Tuesday, 17 May 2016, Olivier Lalonde wrote: > > > > > > > > > Hi all, > > > > > > > > > > I am considering adopting an "event sourcing" architecture for a > > > system I > > > > > am developing and Kafka seems like a good choice of store for > events. > > > > > > > > > > For those who aren't aware, this architecture style consists in > > storing > > > > all > > > > > state changes of the system as an ordered log of events and > building > > > > > derivative views as needed for easier querying (using a SQL > database > > > for > > > > > example). Those views must be completely derived from the event log > > > alone > > > > > so that the log effectively becomes a "single source of truth". > > > > > > > > > > I was wondering if anyone else is using Kafka for that purpose and > > more > > > > > specifically: > > > > > > > > > > 1) Can Kafka store messages permanently? > > > > > > > > > > > > No. Whilst you can tweak config and such to get a very long retention > > > > period, this doesn't work well with Kafka at all. Keeping data around > > > > forever has severe impacts on the operability of your cluster. For > > > example, > > > > if a machine fails, a replacement would have to catch up with vast > > > > quantities of data from its replicas. Currently we (Heroku Kafka) > > > restrict > > > > our customers to a maximum of 14 days of retention, because of all > the > > > > operational headaches of more retention than that. Of course on your > > own > > > > cluster you *can* set it as high as you like, this is just an > anecdotal > > > > experience thing from a team that runs thousands of clusters - > infinite > > > > retention is an operational disaster waiting to happen. > > > > > > > > Whilst Kafka does have a replay mechanism, that should mostly be > though > > > of > > > > as a mechanism for handling other system failures. E.g. If the > database > > > you > > > > store indexed views is is down, Kafka's replay and retention > mechanisms > > > > mean you're not losing data whilst restoring the availability of that > > > > database. > > > > > > > > What we typically suggest customers do when they ask about this use > > case > > > is > > > > to use Kafka as a messaging system, but use e.g. S3 as the long term > > > store. > > > > Kafka can help with batching writes up to S3 (see e.g. Pintrest's > secor > > > > project), and act as a very high throughput, durable, replicated > > > messaging > > > > layer for communication. In this paradigm, when you want to replay, > you > > > do > > > > so out of S3 until you've consumed the last offset there, then start > > > > replaying out of and catching up with the small amount of remaining > > data > > > in > > > > Kafka. Of course the replay logic there has to be hand rolled, as > Kafka > > > and > > > > its clients have no knowledge of external stores. > > > > > > > > Another potential thing to look at is Kafka's compacted topic > > mechanism. > > > > With compacted topics, Kafka keeps the latest element for a given > key, > > > > making it act a little more like a database table. Note that you > still > > > have > > > > to consume by offset here - there's no "get the value for key Y > > > > operation". However, this assumes that your keyspace is still > tractably > > > > small, and that you're ok with keeping only the latest value. > > Compaction > > > > completely overrides time based retention, so you have to "delete" > keys > > > or > > > > have a bounded keyspace if you want to retain operational sanity with > > > > Kafka. I'd recommend reading the docs on compacted topics, they cover > > the > > > > use cases quite well. > > > > > > > > > > > > > > > > > > > > > > 2) Let's say I throw away my derived view and want to re-build it > > from > > > > > scratch, is it possible to consume messages from a topic from its > > very > > > > > first message and once it has caught up, listen for new messages > like > > > it > > > > > would normally do? > > > > > > > > > > > > That's entirely possible, you can catch up from the first retained > > > message > > > > and then continue from there very easily. However, see above about > > > infinite > > > > retention. > > > > > > > > > > > > > > > > > > > > > > 2) Does it support transactions? Let's say I want to push 3 > messages > > > > > atomically but the producer process crashes after sending only 2 > > > > messages, > > > > > is it possible to "rollback" the first 2 messages (e.g. "all or > > > nothing" > > > > > semantics)? > > > > > > > > > > > > No. Kafka at the moment only supports "at least once" semantics, and > > > there > > > > are no cross broker transactions of any kind. Implementing such a > thing > > > > would likely have huge negative impacts on the current performance > > > > characteristics of Kafka, which would be a issue for many users. > > > > > > > > > > > > > > > > > > 3) Does it support request/response style semantics or can they be > > > > > simulated? My system's primary interface with the outside world is > an > > > > HTTP > > > > > API so it would be nice if I could publish an event and wait for > all > > > the > > > > > internal services which need to process the event to be "done" > > > > > processing before returning a response. > > > > > > > > > > > > > > > > In theory that's possible - the producer can return the offset of the > > > > message produced, and you could check the latest offset of each > > consumer > > > in > > > > your web request handler. > > > > > > > > However, doing so is not going to work that well, unless you're ok > with > > > > your web requests taking on the order of seconds to tens of seconds > to > > > > fulfill. Kafka can do low latency messaging reasonably well, but > > > > coordinating the offsets of many consumers would likely have a huge > > > latency > > > > impact. Writing the code for it and getting it handling failure > > correctly > > > > would likely be a lot of work (there's nothing in any of the client > > > > libraries like this, because it is not a desirable or supported use > > > case). > > > > > > > > Instead I'd like to query *why* you need those semantics? What's the > > > issue > > > > with just producing a message and telling the user HTTP 200 and later > > > > consuming it. > > > > > > > > > > > > > > > > > > > > > > PS: I'm a Node.js/Go developer so when possible please avoid Java > > > centric > > > > > terminology. > > > > > > > > > > > > Please to note that the node and go clients are notably less mature > > than > > > > the JVM clients, and that running Kafka in production means knowing > > > enough > > > > about the JVM and Zookeeper to handle that. > > > > > > > > Thanks! > > > > Tom Crayford > > > > Heroku Kafka > > > > > > > > > > > > > > Thanks! > > > > > > > > > > - Oli > > > > > > > > > > > > > > -- > > > > > - Oli > > > > > > > > > > Olivier Lalonde > > > > > http://www.syskall.com > > > > <-- connect with me! > > > > > > > > > > > > > > > > > > > > > > > > > > -- > *Christian Posta* > twitter: @christianposta > http://www.christianposta.com/blog > http://fabric8.io > > > > > > -- *Christian Posta* twitter: @christianposta http://www.christianposta.com/blog http://fabric8.io