Hey Kafka community. I'm researching possible architecture for a distributed data processing system. In this system, there's a close relationship between a specific dataset and the processing code. The user might upload a few datasets and write code to run analysis on that data. In other words, frequently the analysis code pulls data from a specific entity.
Kafka is attractive for lots of reasons: - I'll need messaging anyway - I want a model for immutability of data (mutable state and potential job failure don't mix) - cross-language clients - the change stream concept could have some nice uses (such as updating visualizations without rebuilding) - Samza's model of state management is a simple way to think of external data without worrying too-much about network-based RPC - as a source of truth data store, it's really simple. No mutability, complex queries, etc. Just a log. To me, that helps prevent abuse and mistakes. - it fits well with the concept of pipes, frequently found in data analysis But most of the Kafka examples are about processing a large stream of a specific _type_, not so much about processing specific entities. And I understand there are limits to topics (file/node limits on the filesystem and in zookeeper) and it's discouraged to model topics based on characteristics of data. In this system, it feels more natural to have a topic per entity so the processing code can connect directly to the data it wants. So I need a little guidance from smart people. Am I lost in the rabbit hole? Maybe I'm trying to force Kafka into this territory it's not suited for. Have I been reading too many (awesome) articles about the role of the log and streaming in distributed computing? Or am I on the right track and I just need to put in some work to jump the hurdles (such as topic storage and coordination)? It sounds like Cassandra might be another good option, but I don't know much about it yet. Thanks guys!