Hi Devin, The complexity here is orchestrating multiple functions that notionally form the same connector. Thus stopping a batch connector would be equivalent of stopping two functions and the corresponding complexities of dealing with failure in-between the calls. Same goes for other calls.
On Thu, May 21, 2020 at 7:41 PM Devin Bost <devin.b...@gmail.com> wrote: > I apologize for not fully understanding the context here, but is the > concern about using the existing function architecture the complexity of > needing two sequential operations in a function flow to be synchronous with > respect to transactions, such as to avoid race conditions and issues with > parallelism that could result from them being transactionally independent > of one another? > > Perhaps there's a larger use case here that could be represented with a > pattern. > > -- > Devin G. Bost > > On Wed, May 20, 2020, 7:19 PM Sijie Guo <guosi...@gmail.com> wrote: > > > Hi Jerry, > > > > I understand the concerns. I think it falls into a broker discussion of > > function composition. > > > > I am fine with the current proposal. But I wish that we don't introduce a > > lot of specialized code in the runtime to just handle this use case. It > > would be better if we can reuse the existing function framework. Because > it > > will be easier to implement such multiple-function functionality with a > > more general function composition approach. > > > > - Sijie > > > > On Wed, May 20, 2020 at 3:54 PM Jerry Peng <jerry.boyang.p...@gmail.com> > > wrote: > > > > > Hi Sijie, > > > > > > We have considered a two stag function as a way implement a "batch" > > source, > > > however because there are two independent functions, it adds complexity > > to > > > management especially when there are failures. The two functions will > > need > > > to be submitted and registered in an atomic fashion which cannot be > > > guaranteed in the current framework. Moreover, all CRUD operations for > > the > > > "batch" source will also need to be atomic across the two functions. > The > > > other approach is implement logic that will run two separate instance > > types > > > for a single function. One type of the function instance will run the > > > "discovery" phase and the other type will read the tasks from discovery > > > phase and execute them. Though that approach is very similar to the > > > proposed approach. > > > > > > I think the most important thing right now is to make sure the > interface > > > for the batch source is appropriate since its hard to change in the > > future. > > > How the execution works in the backend can always be modified/optimized > > in > > > the future. > > > > > > On Wed, May 20, 2020 at 3:28 PM Sijie Guo <guosi...@gmail.com> wrote: > > > > > > > Hi Sanjeev, > > > > > > > > Just a couple of thoughts here. It seems to me that the BatchSource > API > > > is > > > > a bit complicated and it can be achieved by using existing functions > > > > framework. > > > > > > > > - BatchSourceTrigger: can be implemented using a one-instance > function. > > > > That is used for discovering the batch source tasks and returning the > > > > discovered tasks. So the discovered tasks are published to its output > > > > topic. > > > > - BatchSource: can be implemented using a function that is receiving > > the > > > > batch source tasks and execute the source task. > > > > > > > > So it seems that this can be achieved using the existing framework by > > > > combining two functions together. It seems that we can achieve with a > > > much > > > > clearer approach and keep the function & connector API relatively > > simple > > > > and consistent. Thoughts? > > > > > > > > - Sijie > > > > > > > > On Wed, May 20, 2020 at 8:33 AM Sanjeev Kulkarni < > sanjee...@gmail.com> > > > > wrote: > > > > > > > > > Pinging the community about this. Would love feedback on this. > > > > > Thanks! > > > > > > > > > > On Wed, May 13, 2020 at 10:34 PM Sanjeev Kulkarni < > > sanjee...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > The current interfaces for sources in Pulsar IO are geared > towards > > > > > > streaming sources where data is available on a continuous basis. > > > There > > > > > > exist a whole bunch of data sources where data is not available > on > > a > > > > > > continuous/streaming fashion, but rather arrives periodically/in > > > > spurts. > > > > > > These set of 'Batch Sources' have a set of common characteristics > > > that > > > > > > might warrant framework level support in Pulsar IO. > > > > > > > > > > > > Jerry and myself have jotted down the ideas around this in > PIP-65. > > > > Please > > > > > > review it and let us know what you think. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/wiki/PIP-65:-Adapting-Pulsar-IO-Sources-to-support-Batch-Sources > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > >