Re: How to build dependencies and connections between stream jobs?

徐涛 Tue, 18 Jun 2019 18:55:54 -0700

Hi Knauf,
        The solution that I can think of to coordinate between different stream 
jobs is :
        For example there are two streaming jobs, Job_1 and Job_2:
        Job_1:   receive data from the original kafka topic,  TOPIC_ORIG  for 
example, sink the data to another kafka topic, TOPIC_JOB_1_SINK for example. It 
should be mentioned that:  ① I implement a retract kafka sink   ②I do not use 
kafka exactly-once sink ③ every record in the TOPIC_JOB_1_SINK should have one 
unique key.  ④ each record with the same key should be send to the same kafka 
partition.
        Job_2:  receive data from TOPIC_JOB_1_SINK, first group by the unique 
key and get the latest value, then go on with the logic of job 2 , finally sink 
the data to final sink(es, hbase, mysql for example)
                     Here I group by unique key first, because Job_1 may fail 
and retry, so some dirty data may be included in the TOPIC_JOB_1_SINK.


        So from the overview:
        Job_1                                                                   
                                                Job_2
        
-------------------------------------------------------------------------------------
                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
        |      TOPIC_ORIG -> Logic_Job_1 -> TOPIC_JOB_1_SINK      |           
——>     |     TOPIC_JOB_1_SINK -> GROUP_BY_UNIQUE_KEY_GET_LATEST -> Logic_Job_2 
-> FINAL_JOB_2_SINK    |
        
-------------------------------------------------------------------------------------
                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------


        Would you please help review the solution, if there are some better 
solutions, kindly let me know about it , thank you.


Best 
Henry

> 在 2019年6月3日，下午4:01，Konstantin Knauf <konstan...@ververica.com> 写道：
> 
> Hi Henry, 
> 
> Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are in 
> general the most common way to connect multiple streaming jobs. The 
> dependencies between streaming jobs are in my experience of a different 
> nature though. For batch jobs, it makes sense to schedule one after the other 
> or having more complicated relationships. Streaming jobs are all processing 
> data continuously, so the "coordination" happens on a different level. 
> 
> To avoid duplication, you can use the Kafka exactly-once sink, but this comes 
> with a latency penalty (transactions are only committed on checkpoint 
> completion). 
> 
> Generally, I would advise to always attach meaningful timestamps to your 
> records, so that you can use watermarking [1] to trade off between latency 
> and completeness. These could also be used to identify late records 
> (resulting from catch up after recovers), which should be ignored by 
> downstream jobs. 
> 
> There are other users, who assign a unique ID to every message going through 
> there system and only use idempotent operations (set operations) within 
> Flink, because messages are sometimes already duplicated before reaching the 
> stream processor. For downstream jobs, where an upstream job might duplicate 
> records, this could be a viable, yet limiting, approach as well. 
> 
> Hope this helps and let me know, what you think. 
> 
> Cheers, 
> 
> Konstantin
> 
> 
> 
> 
> 
> 
> 
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks
>  
> <https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks>
>  
> 
> 
> 
> 
> 
> On Thu, May 30, 2019 at 11:39 AM 徐涛 <happydexu...@gmail.com 
> <mailto:happydexu...@gmail.com>> wrote:
> Hi Experts,
>         In batch computing, there are products like Azkaban or airflow to 
> manage batch job dependencies. By using the dependency management tool, we 
> can build a large-scale system consist of small jobs.
>         In stream processing, it is not practical to put all dependencies in 
> one job, because it will make the job being too complicated, and the state is 
> too large. I want to build a large-scale realtime system which is consist of 
> many Kafka sources and many streaming jobs, but the first thing I can think 
> of is how to build the dependencies and connections between streaming jobs. 
>         The only method I can think of is using a self-implemented retract 
> Kafka sink, each streaming job is connected by Kafka topic. But because each 
> job may fail and retry, for example, the message in Kafka topic may look like 
> this:
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         { “retract”:”true”, “id”:”1”, “amount”:100 }
>         { “retract”:”true”, “id”:”2”, “amount”:200 }
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         if the topic is “topic_1”, the SQL in the downstream job may look 
> like this:
>                 select 
>                         id, latest(amount) 
>                 from topic_1
>                 where retract=“false"
>                 group by id
> 
>         But it will also make big state because each id is being grouped.
>         I wonder if using Kafka to connect streaming jobs is applicable, how 
> to build a large-scale realtime system consists of many streaming job?
> 
>         Thanks a lot.
> 
> Best
> Henry
> 
> 
> -- 
> Konstantin Knauf | Solutions Architect
> +49 160 91394525
> 
> Planned Absences: 7.6.2019, 20. - 21.06.2019
> 
>  <https://www.ververica.com/>
> Follow us @VervericaData
> --
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: How to build dependencies and connections between stream jobs?

Reply via email to