Hi Sijie,

We have considered a two stag function as a way implement a "batch" source,
however because there are two independent functions, it adds complexity to
management especially when there are failures.  The two functions will need
to be submitted and registered in an atomic fashion which cannot be
guaranteed in the current framework.  Moreover, all CRUD operations for the
"batch" source will also need to be atomic across the two functions.  The
other approach is implement logic that will run two separate instance types
for a single function.  One type of the function instance will run the
"discovery" phase and the other type will read the tasks from discovery
phase and execute them.  Though that approach is very similar to the
proposed approach.

I think the most important thing right now is to make sure the interface
for the batch source is appropriate since its hard to change in the future.
How the execution works in the backend can always be modified/optimized in
the future.

On Wed, May 20, 2020 at 3:28 PM Sijie Guo <guosi...@gmail.com> wrote:

> Hi Sanjeev,
>
> Just a couple of thoughts here. It seems to me that the BatchSource API is
> a bit complicated and it can be achieved by using existing functions
> framework.
>
> - BatchSourceTrigger: can be implemented using a one-instance function.
> That is used for discovering the batch source tasks and returning the
> discovered tasks. So the discovered tasks are published to its output
> topic.
> - BatchSource: can be implemented using a function that is receiving the
> batch source tasks and execute the source task.
>
> So it seems that this can be achieved using the existing framework by
> combining two functions together. It seems that we can achieve with a much
> clearer approach and keep the function & connector API relatively simple
> and consistent. Thoughts?
>
> - Sijie
>
> On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni <sanjee...@gmail.com>
> wrote:
>
> > Pinging the community about this. Would love feedback on this.
> > Thanks!
> >
> > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni <sanjee...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > The current interfaces for sources in Pulsar IO are geared towards
> > > streaming sources where data is available on a continuous basis. There
> > > exist a whole bunch of data sources where data is not available on a
> > > continuous/streaming fashion, but rather arrives periodically/in
> spurts.
> > > These set of 'Batch Sources' have a set of common characteristics that
> > > might warrant framework level support in Pulsar IO.
> > >
> > > Jerry and myself have jotted down the ideas around this in PIP-65.
> Please
> > > review it and let us know what you think.
> > >
> > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources
> > >
> > > Thanks!
> > >
> >
>

Reply via email to