Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Sanjeev Kulkarni Fri, 22 May 2020 10:59:32 -0700

Hi Devin,
The complexity here is orchestrating multiple functions that notionally
form the same connector. Thus stopping a batch connector would be
equivalent of stopping two functions and the corresponding complexities of
dealing with failure in-between the calls. Same goes for other calls.


On Thu, May 21, 2020 at 7:41 PM Devin Bost <[email protected]> wrote:

> I apologize for not fully understanding the context here, but is the
> concern about using the existing function architecture the complexity of
> needing two sequential operations in a function flow to be synchronous with
> respect to transactions, such as to avoid race conditions and issues with
> parallelism that could result from them being transactionally independent
> of one another?
>
> Perhaps there's a larger use case here that could be represented with a
> pattern.
>
> --
> Devin G. Bost
>
> On Wed, May 20, 2020, 7:19 PM Sijie Guo <[email protected]> wrote:
>
> > Hi Jerry,
> >
> > I understand the concerns. I think it falls into a broker discussion of
> > function composition.
> >
> > I am fine with the current proposal. But I wish that we don't introduce a
> > lot of specialized code in the runtime to just handle this use case. It
> > would be better if we can reuse the existing function framework. Because
> it
> > will be easier to implement such multiple-function functionality with a
> > more general function composition approach.
> >
> > - Sijie
> >
> > On Wed, May 20, 2020 at 3:54 PM Jerry Peng <[email protected]>
> > wrote:
> >
> > > Hi Sijie,
> > >
> > > We have considered a two stag function as a way implement a "batch"
> > source,
> > > however because there are two independent functions, it adds complexity
> > to
> > > management especially when there are failures.  The two functions will
> > need
> > > to be submitted and registered in an atomic fashion which cannot be
> > > guaranteed in the current framework.  Moreover, all CRUD operations for
> > the
> > > "batch" source will also need to be atomic across the two functions.
> The
> > > other approach is implement logic that will run two separate instance
> > types
> > > for a single function.  One type of the function instance will run the
> > > "discovery" phase and the other type will read the tasks from discovery
> > > phase and execute them.  Though that approach is very similar to the
> > > proposed approach.
> > >
> > > I think the most important thing right now is to make sure the
> interface
> > > for the batch source is appropriate since its hard to change in the
> > future.
> > > How the execution works in the backend can always be modified/optimized
> > in
> > > the future.
> > >
> > > On Wed, May 20, 2020 at 3:28 PM Sijie Guo <[email protected]> wrote:
> > >
> > > > Hi Sanjeev,
> > > >
> > > > Just a couple of thoughts here. It seems to me that the BatchSource
> API
> > > is
> > > > a bit complicated and it can be achieved by using existing functions
> > > > framework.
> > > >
> > > > - BatchSourceTrigger: can be implemented using a one-instance
> function.
> > > > That is used for discovering the batch source tasks and returning the
> > > > discovered tasks. So the discovered tasks are published to its output
> > > > topic.
> > > > - BatchSource: can be implemented using a function that is receiving
> > the
> > > > batch source tasks and execute the source task.
> > > >
> > > > So it seems that this can be achieved using the existing framework by
> > > > combining two functions together. It seems that we can achieve with a
> > > much
> > > > clearer approach and keep the function & connector API relatively
> > simple
> > > > and consistent. Thoughts?
> > > >
> > > > - Sijie
> > > >
> > > > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Pinging the community about this. Would love feedback on this.
> > > > > Thanks!
> > > > >
> > > > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > The current interfaces for sources in Pulsar IO are geared
> towards
> > > > > > streaming sources where data is available on a continuous basis.
> > > There
> > > > > > exist a whole bunch of data sources where data is not available
> on
> > a
> > > > > > continuous/streaming fashion, but rather arrives periodically/in
> > > > spurts.
> > > > > > These set of 'Batch Sources' have a set of common characteristics
> > > that
> > > > > > might warrant framework level support in Pulsar IO.
> > > > > >
> > > > > > Jerry and myself have jotted down the ideas around this in
> PIP-65.
> > > > Please
> > > > > > review it and let us know what you think.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] PIP-65: Adapting Pulsar IO Sources to support Batch Sources

Reply via email to