Re: Seeking Feedback - New ListS3 Processor Contribution??

Joe Witt Tue, 17 Sep 2019 13:01:32 -0700

Hello

This is certainly an important feature space and a commonly used
processor/pattern (ListS3) and in that spirit it would be good to
collaborate, contribute, etc..

We've seen similar asks come up for other List* processors as well.  The
general idea is that some source data will trigger the desire to scan a
specified thing going forward for updates/listings.

The purpose of the List* processors is 'once told to look at a thing' it
should generate a listing of what it is there AND it should continue to
look at the thing to see what changes/shows up later.  Arguably we should
have called that Watch*.

If your use case is 'once told to look at a thing' it should generate a
listing of what is there at that time and not worry about later updates
happening because you can always redo a listing again later.  The
difference there is subtle but important.  And for this case/desire the
name 'List*' makes a lot of sense and may be why we've seen folks want it
to work this way rather than how it does.

For the way the List* processors are generally designed the desire to have
flowfile specific indicators of which paths to look at doesn't really align
to how it is designed.  The problem is hard because there is an inherently
growing list of things to monitor and often a growing list of contents to
monitor.  This makes the approach and understanding of what it is happening
really difficult.

You mention wanting to alter how state is managed and your reasoning makes
sense.  You need additional qualifiers now beyond just the name of the
thing such as the bucket in which the thing exists.  Any state handling
changes for ListS3 should honor the old model where if the bucket isn't
specified it would come from the bucket of the ListS3 processor itself.
This isn't a deal breaker but it is important to consider.

Two alternatives in general worth exploring
1) Using SQS as a notifier of when new items show up in S3 is a good
model.  It would be something like our GetSQS feeding into FetchS3, etc..
This may be a better/more scaleable approach in general.
2) The common case is there is a discrete set of buckets that will be
listed.  You could have a flow which has a ListS3 for each such bucket.
Ahead of these ListS3 processors then could be a RouteOnAttribute processor
which looks at a specific bucket name in a ff attribute and routes to the
proper ListS3 processor.

There may be details I'm overlooking here or I may have misunderstood
portions of your note.  But the above concerns in general are important to
keep in mind for the List/Fetch processors and their intended usage and
design.

If you started from scratch and didn't worry about ListS3 you could
generate a WatchS3 that takes multiple buckets specified on input or allows
them to arrive dynamically in the form of flow file attributes.  You'd want
to document the maximum number of buckets it will monitor at once, maximum
amount of state/storage it can have to track the various listings in
various buckets, and so on.

I would avoid the prefix property concept.  In general you want to avoid
binding a property value to the life of a processor property.  There are
proper lifecycle hooks to handle those changes/cases.

Thanks
Joe

On Tue, Sep 17, 2019 at 3:06 PM Aram Openden <[email protected]> wrote:

> X-Posted to NiFi Users Mailing List
> <http://apache-nifi-users-list.2361937.n4.nabble.com/>.
>
> The team I work in is doing a good deal of work with NiFi S3 Processors
> amongst others and writing some of our own custom processors. Our team has
> a similar use-case requirement for a variation on the ListS3 Processor as
> Martijn Dekkers in this post here
> <
> http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-tp5777p5850.html
> >.
>
>
> For context, the reader may wish to refer to this entire thread from the
> beginning
> <
> http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-td5777.html#a5850
> >
> .
>
> In our case we would like the processor to allow for incoming FlowFiles and
> be able to change the S3 bucket it "listens to" by making the s3.bucket
> attribute modifiable using the NiFi expression language while continuing to
> maintain the internal state of the Processor. We would simultaneously
> restrict the prefix property to be updated, making it a fixed value for the
> entire lifetime of the Processor's running.  In other words, we want a
> WatchMultipleS3Buckets Processor that maintains state for multiple Buckets.
>
> To make this work requires a change in the state management behavior of the
> processor. The currentKeys field is a Set that holds the collection of
> unique "keys" (filenames) that correspond to each of the StateMap's file
> entry that it is tracking. Each key is the S3 Object's associated "
> *filename*".
>
> In practice this means, that our new processor would modify the state store
> logic. Currently, the value for each entry in the StateMap is simply the
> filename of the S3 object. Our suggested change in the StateMap's HashMap
> would have this value now be  of the *bucketName + some delimiter +
> filename* of the S3 Object.
>
> Our team is working on our variation of this *WatchMultipleS3Buckets*. We
> would like to offer to contribute back this effort as follows. Since there
> will be a great deal of common code between the current ListS3 Processor
> and our newly proposed WatchMultipleS3Buckets Processor, I propose a
> refactoring to create a new Abstract class: *AbstractS3WatchProcessor* with
> the existing ListS3 and the newly created WatchMultipleS3Buckets as
> subclasses of this new AbstractS3WatchProcessor.
>
> Is this additional Processor & modification something the community would
> be interested in? We are asking because we want to know if this is a
> direction that the community would like to go in with the existing ListS3
> processor. We will be happy to do the work to contribute this Processor
> variation back to the project, but would prefer not to put the extra work
> in to contributing *if that is not the desired direction* by the NiFi
> Maintainers.
>
> If yes, the new Processor contribution is desired, should I simply go ahead
> and add a new item to the NiFi project JIRA here
> <https://issues.apache.org/jira/projects/NIFI> and then follow section 6
> <
> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-providingCodeOrDocumentationContributionProvidingcodeordocumentationcontributions
> >
> (Providing
> code or documentation contributions) of the Contributor Guide?
>
> Thanks
>
> --aramcodez
>

Re: Seeking Feedback - New ListS3 Processor Contribution??

Reply via email to