Re: [DISCUSS]FLIP-150: Introduce Hybrid Source

Aljoscha Krettek Fri, 08 Jan 2021 04:07:20 -0800

Hi Nicholas,

Thanks for starting the discussion!

I think we might be able to simplify this a bit and re-use existingfunctionality.

There is already `Source.restoreEnumerator()` and`SplitEnumerator.snapshotState(). This seems to be roughly what theHybrid Source needs. When the initial source finishes, we can take asnapshot (which should include data that the follow-up sources need forinitialization). Then we need a function that maps the enumeratorcheckpoint types between initial source and new source and we are goodto go. We wouldn't need to introduce any additional interfaces forsources to implement, which would fragment the ecosystem between sourcesthat can be used in a Hybrid Source and sources that cannot be used in aHybrid Source.


What do you think?

Best,
Aljoscha

On 2020/11/03 02:34, Nicholas Jiang wrote:

Hi devs,

I'd like to start a new FLIP to introduce the Hybrid Source. The hybrid
source is a source that contains a list of concrete sources. The hybrid
source reads from each contained source in the defined order. It switches
from source A to the next source B when source A finishes.

In practice, many Flink jobs need to read data from multiple sources in
sequential order. Change Data Capture (CDC) and machine learning feature
backfill are two concrete scenarios of this consumption pattern. Users may
have to either run two different Flink jobs or have some hacks in the
SourceFunction to address such use cases.

To support above scenarios smoothly, the Flink jobs need to first read from
HDFS for historical data then switch to Kafka for real-time records. The
hybrid source has several benefits from the user's perspective:

- Switching among multiple sources is easy based on the switchable source
implementations of different connectors.
- This supports to automatically switching for user-defined switchable
source that constitutes hybrid source.
- There is complete and effective mechanism to support smooth source
migration between historical and real-time data.

Therefore, in this discussion, we propose to introduce a “Hybrid Source” API
built on top of the new Source API (FLIP-27) to help users to smoothly
switch sources. For more detail, please refer to the FLIP design doc[1].

I'm looking forward to your feedback.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source>

Best,
Nicholas Jiang



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Re: [DISCUSS]FLIP-150: Introduce Hybrid Source

Reply via email to