Re: Seeking Feedback - New ListS3 Processor Contribution??

Aram Openden Wed, 09 Oct 2019 15:57:48 -0700

Joe,

FYI: I have created this new JIRA feature story
<https://issues.apache.org/jira/browse/NIFI-6762> to track my suggested
contribution of this new Processor. I mostly took your advice, and wrote a
new Processor that would "generate a WatchS3 that takes multiple buckets
specified on input or allows them to arrive dynamically in the form of flow
file attributes". It is a new Processor class that is somewhat based on,
but substantially modifies the inner logic of, ListS3 for a WatchS3
(Multiple buckets) function.


Since, I have most of this work done already at least to make it useful for
Watching Multiple Buckets, I am offering for the team to assign this Story
(NIFI-6762) to me if you folks decide you are interested in the Processor.

One piece of help I could use, if you or someone on the team gets the
chance... I am (oddly) struggling with my clean env local maven build of
the full NiFi project. This was before even trying to change any code (or
even cut a branch) on the master branch of the NiFi project.

Please see my message to dev mailing list with subject* Maven Build Error -
nifi-properties-loader sub-project test failures. *If you have any
suggestions about how I can get past or work around this issue please let
me know.

Thanks.

Aram S. Openden
Senior Software Engineer
Cleveland, OH
[email protected]
330-283-3030 (mobile)


On Tue, Sep 17, 2019 at 4:01 PM Joe Witt <[email protected]> wrote:

> Hello
>
> This is certainly an important feature space and a commonly used
> processor/pattern (ListS3) and in that spirit it would be good to
> collaborate, contribute, etc..
>
> We've seen similar asks come up for other List* processors as well.  The
> general idea is that some source data will trigger the desire to scan a
> specified thing going forward for updates/listings.
>
> The purpose of the List* processors is 'once told to look at a thing' it
> should generate a listing of what it is there AND it should continue to
> look at the thing to see what changes/shows up later.  Arguably we should
> have called that Watch*.
>
> If your use case is 'once told to look at a thing' it should generate a
> listing of what is there at that time and not worry about later updates
> happening because you can always redo a listing again later.  The
> difference there is subtle but important.  And for this case/desire the
> name 'List*' makes a lot of sense and may be why we've seen folks want it
> to work this way rather than how it does.
>
> For the way the List* processors are generally designed the desire to have
> flowfile specific indicators of which paths to look at doesn't really align
> to how it is designed.  The problem is hard because there is an inherently
> growing list of things to monitor and often a growing list of contents to
> monitor.  This makes the approach and understanding of what it is happening
> really difficult.
>
> You mention wanting to alter how state is managed and your reasoning makes
> sense.  You need additional qualifiers now beyond just the name of the
> thing such as the bucket in which the thing exists.  Any state handling
> changes for ListS3 should honor the old model where if the bucket isn't
> specified it would come from the bucket of the ListS3 processor itself.
> This isn't a deal breaker but it is important to consider.
>
> Two alternatives in general worth exploring
> 1) Using SQS as a notifier of when new items show up in S3 is a good
> model.  It would be something like our GetSQS feeding into FetchS3, etc..
> This may be a better/more scaleable approach in general.
> 2) The common case is there is a discrete set of buckets that will be
> listed.  You could have a flow which has a ListS3 for each such bucket.
> Ahead of these ListS3 processors then could be a RouteOnAttribute processor
> which looks at a specific bucket name in a ff attribute and routes to the
> proper ListS3 processor.
>
> There may be details I'm overlooking here or I may have misunderstood
> portions of your note.  But the above concerns in general are important to
> keep in mind for the List/Fetch processors and their intended usage and
> design.
>
> If you started from scratch and didn't worry about ListS3 you could
> generate a WatchS3 that takes multiple buckets specified on input or allows
> them to arrive dynamically in the form of flow file attributes.  You'd want
> to document the maximum number of buckets it will monitor at once, maximum
> amount of state/storage it can have to track the various listings in
> various buckets, and so on.
>
> I would avoid the prefix property concept.  In general you want to avoid
> binding a property value to the life of a processor property.  There are
> proper lifecycle hooks to handle those changes/cases.
>
> Thanks
> Joe
>
> On Tue, Sep 17, 2019 at 3:06 PM Aram Openden <[email protected]>
> wrote:
>
> > X-Posted to NiFi Users Mailing List
> > <http://apache-nifi-users-list.2361937.n4.nabble.com/>.
> >
> > The team I work in is doing a good deal of work with NiFi S3 Processors
> > amongst others and writing some of our own custom processors. Our team
> has
> > a similar use-case requirement for a variation on the ListS3 Processor as
> > Martijn Dekkers in this post here
> > <
> >
> http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-tp5777p5850.html
> > >.
> >
> >
> > For context, the reader may wish to refer to this entire thread from the
> > beginning
> > <
> >
> http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-td5777.html#a5850
> > >
> > .
> >
> > In our case we would like the processor to allow for incoming FlowFiles
> and
> > be able to change the S3 bucket it "listens to" by making the s3.bucket
> > attribute modifiable using the NiFi expression language while continuing
> to
> > maintain the internal state of the Processor. We would simultaneously
> > restrict the prefix property to be updated, making it a fixed value for
> the
> > entire lifetime of the Processor's running.  In other words, we want a
> > WatchMultipleS3Buckets Processor that maintains state for multiple
> Buckets.
> >
> > To make this work requires a change in the state management behavior of
> the
> > processor. The currentKeys field is a Set that holds the collection of
> > unique "keys" (filenames) that correspond to each of the StateMap's file
> > entry that it is tracking. Each key is the S3 Object's associated "
> > *filename*".
> >
> > In practice this means, that our new processor would modify the state
> store
> > logic. Currently, the value for each entry in the StateMap is simply the
> > filename of the S3 object. Our suggested change in the StateMap's HashMap
> > would have this value now be  of the *bucketName + some delimiter +
> > filename* of the S3 Object.
> >
> > Our team is working on our variation of this *WatchMultipleS3Buckets*. We
> > would like to offer to contribute back this effort as follows. Since
> there
> > will be a great deal of common code between the current ListS3 Processor
> > and our newly proposed WatchMultipleS3Buckets Processor, I propose a
> > refactoring to create a new Abstract class: *AbstractS3WatchProcessor*
> with
> > the existing ListS3 and the newly created WatchMultipleS3Buckets as
> > subclasses of this new AbstractS3WatchProcessor.
> >
> > Is this additional Processor & modification something the community would
> > be interested in? We are asking because we want to know if this is a
> > direction that the community would like to go in with the existing ListS3
> > processor. We will be happy to do the work to contribute this Processor
> > variation back to the project, but would prefer not to put the extra work
> > in to contributing *if that is not the desired direction* by the NiFi
> > Maintainers.
> >
> > If yes, the new Processor contribution is desired, should I simply go
> ahead
> > and add a new item to the NiFi project JIRA here
> > <https://issues.apache.org/jira/projects/NIFI> and then follow section 6
> > <
> >
> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-providingCodeOrDocumentationContributionProvidingcodeordocumentationcontributions
> > >
> > (Providing
> > code or documentation contributions) of the Contributor Guide?
> >
> > Thanks
> >
> > --aramcodez
> >
>

Re: Seeking Feedback - New ListS3 Processor Contribution??

Reply via email to