Hi Alex & Andrew, There was a discussion with some pointers on this mailing list a bit ago titled "mapping events to topics". I suggest taking a look at that thread: http://search-hadoop.com/m/uyzND1vJsUuYtGD91/mapping+events+to+topics&subj=mapping+events+to+topics
If you still have questions, don't hesitate to ask. Thanks, Grant On Sat, Oct 31, 2015 at 3:19 AM, Andrew Stevenson <asteven...@outlook.com> wrote: > I too would be interested in any responses to this question. > > I'm using kafka for event notification and once secure will put real > payload in it and take advantage of the durable commit log. I want to let > users describe a DAG in orientdb and have the Kafka Client processor load > and execute it. Each processor would then attach it's lineage and > provenance back to the orientdbs graph store. > > This way I can let users replay stress scenarios, calculate VaR etc with > one source of replayable truth. Compliance and regulatory authorities like > this. > > Regards > > Andrew > ________________________________ > From: Alex Buchanan<mailto:bucha...@gmail.com> > Sent: 31/10/2015 05:30 > To: users@kafka.apache.org<mailto:users@kafka.apache.org> > Subject: Topic per entity > > Hey Kafka community. > > I'm researching possible architecture for a distributed data processing > system. In this system, there's a close relationship between a specific > dataset and the processing code. The user might upload a few datasets and > write code to run analysis on that data. In other words, frequently the > analysis code pulls data from a specific entity. > > Kafka is attractive for lots of reasons: > - I'll need messaging anyway > - I want a model for immutability of data (mutable state and potential job > failure don't mix) > - cross-language clients > - the change stream concept could have some nice uses (such as updating > visualizations without rebuilding) > - Samza's model of state management is a simple way to think of external > data without worrying too-much about network-based RPC > - as a source of truth data store, it's really simple. No mutability, > complex queries, etc. Just a log. To me, that helps prevent abuse and > mistakes. > - it fits well with the concept of pipes, frequently found in data analysis > > But most of the Kafka examples are about processing a large stream of a > specific _type_, not so much about processing specific entities. And I > understand there are limits to topics (file/node limits on the filesystem > and in zookeeper) and it's discouraged to model topics based on > characteristics of data. In this system, it feels more natural to have a > topic per entity so the processing code can connect directly to the data it > wants. > > So I need a little guidance from smart people. Am I lost in the rabbit > hole? Maybe I'm trying to force Kafka into this territory it's not suited > for. Have I been reading too many (awesome) articles about the role of the > log and streaming in distributed computing? Or am I on the right track and > I just need to put in some work to jump the hurdles (such as topic storage > and coordination)? > > It sounds like Cassandra might be another good option, but I don't know > much about it yet. > > Thanks guys! > -- Grant Henke Software Engineer | Cloudera gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke