I too would be interested in any responses to this question. I'm using kafka for event notification and once secure will put real payload in it and take advantage of the durable commit log. I want to let users describe a DAG in orientdb and have the Kafka Client processor load and execute it. Each processor would then attach it's lineage and provenance back to the orientdbs graph store.
This way I can let users replay stress scenarios, calculate VaR etc with one source of replayable truth. Compliance and regulatory authorities like this. Regards Andrew ________________________________ From: Alex Buchanan<mailto:bucha...@gmail.com> Sent: 31/10/2015 05:30 To: users@kafka.apache.org<mailto:users@kafka.apache.org> Subject: Topic per entity Hey Kafka community. I'm researching possible architecture for a distributed data processing system. In this system, there's a close relationship between a specific dataset and the processing code. The user might upload a few datasets and write code to run analysis on that data. In other words, frequently the analysis code pulls data from a specific entity. Kafka is attractive for lots of reasons: - I'll need messaging anyway - I want a model for immutability of data (mutable state and potential job failure don't mix) - cross-language clients - the change stream concept could have some nice uses (such as updating visualizations without rebuilding) - Samza's model of state management is a simple way to think of external data without worrying too-much about network-based RPC - as a source of truth data store, it's really simple. No mutability, complex queries, etc. Just a log. To me, that helps prevent abuse and mistakes. - it fits well with the concept of pipes, frequently found in data analysis But most of the Kafka examples are about processing a large stream of a specific _type_, not so much about processing specific entities. And I understand there are limits to topics (file/node limits on the filesystem and in zookeeper) and it's discouraged to model topics based on characteristics of data. In this system, it feels more natural to have a topic per entity so the processing code can connect directly to the data it wants. So I need a little guidance from smart people. Am I lost in the rabbit hole? Maybe I'm trying to force Kafka into this territory it's not suited for. Have I been reading too many (awesome) articles about the role of the log and streaming in distributed computing? Or am I on the right track and I just need to put in some work to jump the hurdles (such as topic storage and coordination)? It sounds like Cassandra might be another good option, but I don't know much about it yet. Thanks guys!