Hi Henry, Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are in general the most common way to connect multiple streaming jobs. The dependencies between streaming jobs are in my experience of a different nature though. For batch jobs, it makes sense to schedule one after the other or having more complicated relationships. Streaming jobs are all processing data continuously, so the "coordination" happens on a different level.
To avoid duplication, you can use the Kafka exactly-once sink, but this comes with a latency penalty (transactions are only committed on checkpoint completion). Generally, I would advise to always attach meaningful timestamps to your records, so that you can use watermarking [1] to trade off between latency and completeness. These could also be used to identify late records (resulting from catch up after recovers), which should be ignored by downstream jobs. There are other users, who assign a unique ID to every message going through there system and only use idempotent operations (set operations) within Flink, because messages are sometimes already duplicated before reaching the stream processor. For downstream jobs, where an upstream job might duplicate records, this could be a viable, yet limiting, approach as well. Hope this helps and let me know, what you think. Cheers, Konstantin [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks On Thu, May 30, 2019 at 11:39 AM 徐涛 <happydexu...@gmail.com> wrote: > Hi Experts, > In batch computing, there are products like Azkaban or airflow to > manage batch job dependencies. By using the dependency management tool, we > can build a large-scale system consist of small jobs. > In stream processing, it is not practical to put all dependencies > in one job, because it will make the job being too complicated, and the > state is too large. I want to build a large-scale realtime system which is > consist of many Kafka sources and many streaming jobs, but the first thing > I can think of is how to build the dependencies and connections between > streaming jobs. > The only method I can think of is using a self-implemented retract > Kafka sink, each streaming job is connected by Kafka topic. But because > each job may fail and retry, for example, the message in Kafka topic may > look like this: > { “retract”:”false”, “id”:”1”, “amount”:100 } > { “retract”:”false”, “id”:”2”, “amount”:200 } > { “retract”:”true”, “id”:”1”, “amount”:100 } > { “retract”:”true”, “id”:”2”, “amount”:200 } > { “retract”:”false”, “id”:”1”, “amount”:100 } > { “retract”:”false”, “id”:”2”, “amount”:200 } > if the topic is “topic_1”, the SQL in the downstream job may look > like this: > select > id, latest(amount) > from topic_1 > where retract=“false" > group by id > > But it will also make big state because each id is being grouped. > I wonder if using Kafka to connect streaming jobs is applicable, > how to build a large-scale realtime system consists of many streaming job? > > Thanks a lot. > > Best > Henry -- Konstantin Knauf | Solutions Architect +49 160 91394525 Planned Absences: 7.6.2019, 20. - 21.06.2019 <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen