I too would be interested in any responses to this question.

I'm using kafka for event notification and once secure will put real payload in 
it and take advantage of the durable commit log. I want to let users describe a 
DAG in orientdb and have the Kafka Client processor load and execute it. Each 
processor would then attach it's lineage and provenance back to the orientdbs 
graph store.

This way I can let users replay stress scenarios, calculate VaR etc with one 
source of replayable truth. Compliance and regulatory authorities like this.

Regards

Andrew
________________________________
From: Alex Buchanan<mailto:bucha...@gmail.com>
Sent: ‎31/‎10/‎2015 05:30
To: users@kafka.apache.org<mailto:users@kafka.apache.org>
Subject: Topic per entity

Hey Kafka community.

I'm researching possible architecture for a distributed data processing
system. In this system, there's a close relationship between a specific
dataset and the processing code. The user might upload a few datasets and
write code to run analysis on that data. In other words, frequently the
analysis code pulls data from a specific entity.

Kafka is attractive for lots of reasons:
- I'll need messaging anyway
- I want a model for immutability of data (mutable state and potential job
failure don't mix)
- cross-language clients
- the change stream concept could have some nice uses (such as updating
visualizations without rebuilding)
- Samza's model of state management is a simple way to think of external
data without worrying too-much about network-based RPC
- as a source of truth data store, it's really simple. No mutability,
complex queries, etc. Just a log. To me, that helps prevent abuse and
mistakes.
- it fits well with the concept of pipes, frequently found in data analysis

But most of the Kafka examples are about processing a large stream of a
specific _type_, not so much about processing specific entities. And I
understand there are limits to topics (file/node limits on the filesystem
and in zookeeper) and it's discouraged to model topics based on
characteristics of data. In this system, it feels more natural to have a
topic per entity so the processing code can connect directly to the data it
wants.

So I need a little guidance from smart people. Am I lost in the rabbit
hole? Maybe I'm trying to force Kafka into this territory it's not suited
for. Have I been reading too many (awesome) articles about the role of the
log and streaming in distributed computing? Or am I on the right track and
I just need to put in some work to jump the hurdles (such as topic storage
and coordination)?

It sounds like Cassandra might be another good option, but I don't know
much about it yet.

Thanks guys!

Reply via email to