Re: Streaming Splitting Processor

Joe Witt Fri, 05 Mar 2021 13:34:19 -0800

Eric,

My point is that it sounds like you get handed an original document
which is a JSON document.  It contains up to a million elements within
it.  You would implement a record reader for your original doc
structure and then you can use any of our current writers/etc..  But
the important part is avoiding creating splits unless/until totally
necessary/etc..


Anyway if you go the route you're thinking of I think you'll need a
different session for reading (single session for the entire source
file) and a different session for all the splits you'll create. But I
might be over complicating that.  MarkP could give better input.

Thanks

On Fri, Mar 5, 2021 at 2:18 PM Eric Secules <[email protected]> wrote:
>
> Hi Joe,
>
> For my use case partial results are okay.
> The files may contain up to a million records. But we have like a day to
> process it. We will consider record-based processing. It might be a longer
> task to convert our flows to consume records instead of single files.
> Will I need to have multiple sessions to handle all this?
>
> Thanks,
> Eric
>
> On Fri, Mar 5, 2021 at 12:30 PM Joe Witt <[email protected]> wrote:
>
> > Eric
> >
> > The ProcessSession follows a unit of work pattern.  You can do a lot
> > of things but until you commit the session it wont actually commit the
> > change(s).  So if you want the behavior you describe call commit after
> > transfer each time.  This is done automatically for you in most cases
> > but you can call it to control the boundary.  Just remember you risk
> > partial results then.  Consider you're reading the input file which
> > contains 100 records lets say.  On record 51 there is a processing
> > issue.  What happens then?    I'd also suggest this pattern generally
> > results in poor performance.  Can you not use the record
> > reader/writers to accomplish this so you can avoid turning it into a
> > bunch of tiny flowfiles?
> >
> > Thanks
> >
> > On Fri, Mar 5, 2021 at 1:19 PM Eric Secules <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I am trying to write a processor which parses an input file and emits one
> > > JSON flowfile for each record in the input file. Currently we're calling
> > > session.transfer() once we encounter a fragment we want to emit. But it's
> > > not sending the new flowfiles to the next processor as it processes the
> > > input flowfile. Instead it's holding everything until the input is fully
> > > processed and releasing it all at once. Is there some way I can write the
> > > processor to emit flowfiles as soon as possible rather than waiting for
> > > everything to succeed?
> > >
> > > Thanks,
> > > Eric
> >

Re: Streaming Splitting Processor

Reply via email to