You might find what you want when looking how Kafka is used for samza, http://samza.apache.org/
On Mon, Mar 14, 2016 at 10:34 AM Daniel Schierbeck <da...@zendesk.com.invalid> wrote: > Partitions being limited by disk size is no different from e.g. a SQL > store. This would not be used for extremely high throughput. If, > eventually, there was a good case for not requiring that an entire > partition must be stored on a single machine, it would be possible to use > the log segments for distribution. > > On Mon, Mar 14, 2016 at 9:29 AM Giidox <a...@marmelandia.com> wrote: > > > I would like to read an answer to this question as well. This is a > similar > > architecture as I am planning. Dealing with secondary data store for old > > messages would indeed make things complicated. > > > > Clark Haskins wrote that the partition size is limited by machines > > capacity (I assume disk space): > > > https://mail-archives.apache.org/mod_mbox/kafka-users/201504.mbox/%3ce7b3c4a4-bb72-43f2-8848-9e09d0dcb...@kafka.guru%3E > > < > https://mail-archives.apache.org/mod_mbox/kafka-users/201504.mbox/%3ce7b3c4a4-bb72-43f2-8848-9e09d0dcb...@kafka.guru%3E> > < > > > https://mail-archives.apache.org/mod_mbox/kafka-users/201504.mbox/%3ce7b3c4a4-bb72-43f2-8848-9e09d0dcb...@kafka.guru%3E > > < > https://mail-archives.apache.org/mod_mbox/kafka-users/201504.mbox/%3ce7b3c4a4-bb72-43f2-8848-9e09d0dcb...@kafka.guru%3E>>. > So in theory one > > could grow a single partition to terabytes-scale. But don’t take my word > > for it, as I have not tried it. > > > > Cheers, Giidox > > > > > > > > > On 09 Mar 2016, at 15:10, Daniel Schierbeck <da...@zendesk.com.INVALID > > > > wrote: > > > > > > I'm considering an architecture where Kafka acts as the primary > > datastore, > > > with infinite retention of messages. The messages in this case will be > > > domain events that must not be lost. Different downstream consumers > would > > > ingest the events and build up various views on them, e.g. aggregated > > > stats, indexes by various properties, full text search, etc. > > > > > > The important bit is that I'd like to avoid having a separate datastore > > for > > > long-term archival of events, since: > > > > > > 1) I want to make it easy to spin up new materialized views based on > past > > > events, and only having to deal with Kafka is simpler. > > > 2) Instead of having some sort of two-phased import process where I > need > > to > > > first import historical data and then do a switchover to the Kafka > > topics, > > > I'd rather just start from offset 0 in the Kafka topics. > > > 3) I'd like to be able to use standard tooling where possible, and most > > > tools for ingesting events into e.g. Spark Streaming would be difficult > > to > > > use unless all the data was in Kafka. > > > > > > I'd like to know if anyone here has tried this use case. Based on the > > > presentations by Jay Kreps and Martin Kleppmann I would expect that > > someone > > > had actually implemented some of the ideas they're been pushing. I'd > also > > > like to know what sort of problems Kafka would pose for long-term > > storage – > > > would I need special storage nodes, or would replication be sufficient > to > > > ensure durability? > > > > > > Daniel Schierbeck > > > Senior Staff Engineer, Zendesk > > > > >