Hi Knauf, The solution that I can think of to coordinate between different stream jobs is : For example there are two streaming jobs, Job_1 and Job_2: Job_1: receive data from the original kafka topic, TOPIC_ORIG for example, sink the data to another kafka topic, TOPIC_JOB_1_SINK for example. It should be mentioned that: ① I implement a retract kafka sink ②I do not use kafka exactly-once sink ③ every record in the TOPIC_JOB_1_SINK should have one unique key. ④ each record with the same key should be send to the same kafka partition. Job_2: receive data from TOPIC_JOB_1_SINK, first group by the unique key and get the latest value, then go on with the logic of job 2 , finally sink the data to final sink(es, hbase, mysql for example) Here I group by unique key first, because Job_1 may fail and retry, so some dirty data may be included in the TOPIC_JOB_1_SINK.
So from the overview: Job_1 Job_2 ------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------- | TOPIC_ORIG -> Logic_Job_1 -> TOPIC_JOB_1_SINK | ——> | TOPIC_JOB_1_SINK -> GROUP_BY_UNIQUE_KEY_GET_LATEST -> Logic_Job_2 -> FINAL_JOB_2_SINK | ------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------- Would you please help review the solution, if there are some better solutions, kindly let me know about it , thank you. Best Henry > 在 2019年6月3日,下午4:01,Konstantin Knauf <konstan...@ververica.com> 写道: > > Hi Henry, > > Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are in > general the most common way to connect multiple streaming jobs. The > dependencies between streaming jobs are in my experience of a different > nature though. For batch jobs, it makes sense to schedule one after the other > or having more complicated relationships. Streaming jobs are all processing > data continuously, so the "coordination" happens on a different level. > > To avoid duplication, you can use the Kafka exactly-once sink, but this comes > with a latency penalty (transactions are only committed on checkpoint > completion). > > Generally, I would advise to always attach meaningful timestamps to your > records, so that you can use watermarking [1] to trade off between latency > and completeness. These could also be used to identify late records > (resulting from catch up after recovers), which should be ignored by > downstream jobs. > > There are other users, who assign a unique ID to every message going through > there system and only use idempotent operations (set operations) within > Flink, because messages are sometimes already duplicated before reaching the > stream processor. For downstream jobs, where an upstream job might duplicate > records, this could be a viable, yet limiting, approach as well. > > Hope this helps and let me know, what you think. > > Cheers, > > Konstantin > > > > > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks > > <https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks> > > > > > > > On Thu, May 30, 2019 at 11:39 AM 徐涛 <happydexu...@gmail.com > <mailto:happydexu...@gmail.com>> wrote: > Hi Experts, > In batch computing, there are products like Azkaban or airflow to > manage batch job dependencies. By using the dependency management tool, we > can build a large-scale system consist of small jobs. > In stream processing, it is not practical to put all dependencies in > one job, because it will make the job being too complicated, and the state is > too large. I want to build a large-scale realtime system which is consist of > many Kafka sources and many streaming jobs, but the first thing I can think > of is how to build the dependencies and connections between streaming jobs. > The only method I can think of is using a self-implemented retract > Kafka sink, each streaming job is connected by Kafka topic. But because each > job may fail and retry, for example, the message in Kafka topic may look like > this: > { “retract”:”false”, “id”:”1”, “amount”:100 } > { “retract”:”false”, “id”:”2”, “amount”:200 } > { “retract”:”true”, “id”:”1”, “amount”:100 } > { “retract”:”true”, “id”:”2”, “amount”:200 } > { “retract”:”false”, “id”:”1”, “amount”:100 } > { “retract”:”false”, “id”:”2”, “amount”:200 } > if the topic is “topic_1”, the SQL in the downstream job may look > like this: > select > id, latest(amount) > from topic_1 > where retract=“false" > group by id > > But it will also make big state because each id is being grouped. > I wonder if using Kafka to connect streaming jobs is applicable, how > to build a large-scale realtime system consists of many streaming job? > > Thanks a lot. > > Best > Henry > > > -- > Konstantin Knauf | Solutions Architect > +49 160 91394525 > > Planned Absences: 7.6.2019, 20. - 21.06.2019 > > <https://www.ververica.com/> > Follow us @VervericaData > -- > Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference > Stream Processing | Event Driven | Real Time > -- > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen