Re: [DISCUSS]FLIP-150: Introduce Hybrid Source

Arvid Heise Mon, 07 Jun 2021 08:07:21 -0700

Sorry for joining the party so late, but it's such an interesting FLIP with
a huge impact that I wanted to add my 2 cents. [1]
I'm mirroring some basic question from the PR review to this thread because
it's about the name:


We could rename the thing to ConcatenatedSource(s), SourceSequence, or
similar.
Hybrid has the connotation of 2 for me (maybe because I'm a non-native) and
does not carry the concatentation concept as well (hybrid sounds to me more
like the source would constantly switch back and forth).

Could we take a few minutes to think if this is the most intuitive name for
new users? I'm especially hoping that natives might give some ideas (or
declare that Hybrid is perfect).

[1] https://github.com/apache/flink/pull/15924#pullrequestreview-677376664

On Sun, Jun 6, 2021 at 7:47 PM Steven Wu <[email protected]> wrote:

> > Converter function relies on the specific enumerator capabilities to set
> the new start position (e.g.
> fileSourceEnumerator.getEndTimestamp() and
> kafkaSourceEnumerator.setTimestampOffsetsInitializer(..)
>
> I guess the premise is that a converter is for a specific tuple of
> (upstream source, downstream source) . We don't have to define generic
> EndtStateT and SwitchableEnumerator interfaces. That should work.
>
> The benefit of defining EndtStateT and SwitchableEnumerator interfaces is
> probably promoting uniformity across sources that support hybrid/switchable
> source.
>
> On Sun, Jun 6, 2021 at 10:22 AM Thomas Weise <[email protected]> wrote:
>
> > Hi Steven,
> >
> > Thank you for the thorough review of the PR and for bringing this back
> > to the mailing list.
> >
> > All,
> >
> > I updated the FLIP-150 page to highlight aspects in which the PR
> > deviates from the original proposal [1]. The goal would be to update
> > the FLIP soon and bring it to a vote, as previously suggested offline
> > by Nicholas.
> >
> > A few minor issues in the PR are outstanding and I'm working on test
> > coverage for the recovery behavior, which should be completed soon.
> >
> > The dynamic position transfer needs to be concluded before we can move
> > forward however.
> >
> > There have been various ideas, including the special
> > "SwitchableEnumerator" interface, using enumerator checkpoint state or
> > an enumerator interface extension to extract the end state.
> >
> > One goal in the FLIP is to "Reuse the existing Source connectors built
> > with FLIP-27 without any change." and I think it is important to honor
> > that goal given that fixed start positions do not require interface
> > changes.
> >
> > Based on the feedback the following might be a good solution for
> > runtime position transfer:
> >
> > * User supplies the optional converter function (not applicable for
> > fixed positions).
> > * Instead of relying on the enumerator checkpoint state [2], the
> > converter function will be supplied with the current and next
> > enumerator (source.createEnumerator).
> > * Converter function relies on the specific enumerator capabilities to
> > set the new start position (e.g.
> > fileSourceEnumerator.getEndTimestamp() and
> > kafkaSourceEnumerator.setTimestampOffsetsInitializer(..)
> > * HybridSourceSplitEnumerator starts new underlying enumerator
> >
> > With this approach, there is no need to augment FLIP-27 interfaces and
> > custom source capabilities are easier to integrate. Removing the
> > mandate to rely on enumerator checkpoint state also avoids potential
> > upgrade/compatibility issues.
> >
> > Thoughts?
> >
> > Thanks,
> > Thomas
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-150%3A+Introduce+Hybrid+Source#FLIP150:IntroduceHybridSource-Prototypeimplementation
> > [2]
> >
> https://github.com/apache/flink/pull/15924/files#diff-e07478b3cad9810925ec784b61ec0026396839cc5b27bd6d337a1dea05e999eaR281
> >
> >
> > On Tue, Jun 1, 2021 at 3:10 PM Steven Wu <[email protected]> wrote:
> > >
> > > discussed the PR with Thosmas offline. Thomas, please correct me if I
> > > missed anything.
> > >
> > > Right now, the PR differs from the FLIP-150 doc regarding the
> converter.
> > > * Current PR uses the enumerator checkpoint state type as the input for
> > the
> > > converter
> > > * FLIP-150 defines a new EndStateT interface.
> > > It seems that the FLIP-150 approach of EndStateT is more flexible, as
> > > transition EndStateT doesn't have to be included in the upstream source
> > > checkpoint state.
> > >
> > > Let's look at two use cases:
> > > 1) static cutover time at 5 pm. File source reads all data btw 9 am - 5
> > pm,
> > > then Kafka source starts with initial position of 5 pm. In this case,
> > there
> > > is no need for converter or EndStateT since the starting time for Kafka
> > > source is known and fixed.
> > > 2) dynamic cutover time at 1 hour before now. This is useful when the
> > > bootstrap of historic data takes a long time (like days or weeks) and
> we
> > > don't know the exact time of cutover when a job is launched. Instead,
> we
> > > are instructing the file source to stop when it gets close to live
> data.
> > In
> > > this case, hybrid source construction will specify a relative time (now
> > - 1
> > > hour), the EndStateT (of file source) will be resolved to an absolute
> > time
> > > for cutover. We probably don't need to include EndStateT (end
> timestamp)
> > as
> > > the file source checkpoint state. Hence, the separate EndStateT is
> > probably
> > > more desirable.
> > >
> > > We also discussed the converter for the Kafka source. Kafka source
> > supports
> > > different OffsetsInitializer impls (including
> > TimestampOffsetsInitializer).
> > > To support the dynamic cutover time (use case #2 above), we can plug
> in a
> > > SupplierTimestampOffsetInitializer, where the starting timestamp is not
> > set
> > > during source/job construction. Rather it is a supplier model where the
> > > starting timestamp value is set to the resolved absolute timestamp
> during
> > > switch.
> > >
> > > Thanks,
> > > Steven
> > >
> > >
> > >
> > > On Thu, May 20, 2021 at 8:59 PM Thomas Weise <[email protected]> wrote:
> > >
> > > > Hi Nicholas,
> > > >
> > > > Thanks for taking a look at the PR!
> > > >
> > > > 1. Regarding switching mechanism:
> > > >
> > > > There has been previous discussion in this thread regarding the pros
> > > > and cons of how the switching can be exposed to the user.
> > > >
> > > > With fixed start positions, no special switching interface to
> transfer
> > > > information between enumerators is required. Sources are configured
> as
> > > > they would be when used standalone and just plugged into
> HybridSource.
> > > > I expect that to be a common use case. You can find an example for
> > > > this in the ITCase:
> > > >
> > > >
> > > >
> >
> https://github.com/apache/flink/pull/15924/files#diff-fe1407d135d7b7b3a72aeb4471ab53ccd2665a58ff0129a83db1ec19cea06f1bR101
> > > >
> > > > For dynamic start position, the checkpoint state is used to transfer
> > > > information from old to new enumerator. An example for that can be
> > > > found here:
> > > >
> > > >
> > > >
> >
> https://github.com/apache/flink/pull/15924/files#diff-fe1407d135d7b7b3a72aeb4471ab53ccd2665a58ff0129a83db1ec19cea06f1bR112-R136
> > > >
> > > > That may look verbose, but the code to convert from one state to
> > > > another can be factored out into a utility and the function becomes a
> > > > one-liner.
> > > >
> > > > For common sources like files and Kafka we can potentially (later)
> > > > implement the conversion logic as part of the respective connector's
> > > > checkpoint and split classes.
> > > >
> > > > I hope that with the PR up for review, we can soon reach a conclusion
> > > > on how we want to expose this to the user.
> > > >
> > > > Following is an example for Files -> Files -> Kafka that I'm using
> for
> > > > e2e testing. It exercises both ways of setting the start position.
> > > >
> > > > https://gist.github.com/tweise/3139d66461e87986f6eddc70ff06ef9a
> > > >
> > > >
> > > > 2. Regarding the events used to implement the actual switch between
> > > > enumerator and readers: I updated the PR with javadoc to clarify the
> > > > intent. Please let me know if that helps or let's continue to discuss
> > > > those details on the PR?
> > > >
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Mon, May 17, 2021 at 1:03 AM Nicholas Jiang <[email protected]>
> > > > wrote:
> > > > >
> > > > > Hi Thomas,
> > > > >
> > > > >    Sorry for later reply for your POC. I have reviewed the based
> > abstract
> > > > > implementation of your pull request:
> > > > > https://github.com/apache/flink/pull/15924. IMO, for the switching
> > > > > mechanism, this level of abstraction is not concise enough, which
> > doesn't
> > > > > make connector contribution easier. In theory, it is necessary to
> > > > introduce
> > > > > a set of interfaces to support the switching mechanism. The
> > > > SwitchableSource
> > > > > and SwitchableSplitEnumerator interfaces are needed for connector
> > > > > expansibility.
> > > > >    In other words, the whole switching process of above mentioned
> PR
> > is
> > > > > different from that mentioned in FLIP-150. In the above
> > implementation,
> > > > the
> > > > > source reading switching is executed after receving the
> > > > SwitchSourceEvent,
> > > > > which could be before the sending SourceReaderFinishEvent. This
> > timeline
> > > > of
> > > > > source reading switching could be discussed here.
> > > > >    @Stephan @Becket, if you are available, please help to review
> the
> > > > > abstract implementation, and compare with the interfaces mentioned
> in
> > > > > FLIP-150.
> > > > >
> > > > > Thanks,
> > > > > Nicholas Jiang
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sent from:
> > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > > >
> >
>

Re: [DISCUSS]FLIP-150: Introduce Hybrid Source

Reply via email to