Hi Experts, In batch computing, there are products like Azkaban or airflow to manage batch job dependencies. By using the dependency management tool, we can build a large-scale system consist of small jobs. In stream processing, it is not practical to put all dependencies in one job, because it will make the job being too complicated, and the state is too large. I want to build a large-scale realtime system which is consist of many Kafka sources and many streaming jobs, but the first thing I can think of is how to build the dependencies and connections between streaming jobs. The only method I can think of is using a self-implemented retract Kafka sink, each streaming job is connected by Kafka topic. But because each job may fail and retry, for example, the message in Kafka topic may look like this: { “retract”:”false”, “id”:”1”, “amount”:100 } { “retract”:”false”, “id”:”2”, “amount”:200 } { “retract”:”true”, “id”:”1”, “amount”:100 } { “retract”:”true”, “id”:”2”, “amount”:200 } { “retract”:”false”, “id”:”1”, “amount”:100 } { “retract”:”false”, “id”:”2”, “amount”:200 } if the topic is “topic_1”, the SQL in the downstream job may look like this: select id, latest(amount) from topic_1 where retract=“false" group by id But it will also make big state because each id is being grouped. I wonder if using Kafka to connect streaming jobs is applicable, how to build a large-scale realtime system consists of many streaming job? Thanks a lot.
Best Henry