Hi Experts,
        In batch computing, there are products like Azkaban or airflow to 
manage batch job dependencies. By using the dependency management tool, we can 
build a large-scale system consist of small jobs.
        In stream processing, it is not practical to put all dependencies in 
one job, because it will make the job being too complicated, and the state is 
too large. I want to build a large-scale realtime system which is consist of 
many Kafka sources and many streaming jobs, but the first thing I can think of 
is how to build the dependencies and connections between streaming jobs. 
        The only method I can think of is using a self-implemented retract 
Kafka sink, each streaming job is connected by Kafka topic. But because each 
job may fail and retry, for example, the message in Kafka topic may look like 
this:
        { “retract”:”false”, “id”:”1”, “amount”:100 }
        { “retract”:”false”, “id”:”2”, “amount”:200 }
        { “retract”:”true”, “id”:”1”, “amount”:100 }
        { “retract”:”true”, “id”:”2”, “amount”:200 }
        { “retract”:”false”, “id”:”1”, “amount”:100 }
        { “retract”:”false”, “id”:”2”, “amount”:200 }
        if the topic is “topic_1”, the SQL in the downstream job may look like 
this:
                select 
                        id, latest(amount) 
                from topic_1
                where retract=“false"
                group by id
        
        But it will also make big state because each id is being grouped.
        I wonder if using Kafka to connect streaming jobs is applicable, how to 
build a large-scale realtime system consists of many streaming job?
        
        Thanks a lot.

Best
Henry

Reply via email to