Joe, FYI: I have created this new JIRA feature story <https://issues.apache.org/jira/browse/NIFI-6762> to track my suggested contribution of this new Processor. I mostly took your advice, and wrote a new Processor that would "generate a WatchS3 that takes multiple buckets specified on input or allows them to arrive dynamically in the form of flow file attributes". It is a new Processor class that is somewhat based on, but substantially modifies the inner logic of, ListS3 for a WatchS3 (Multiple buckets) function.
Since, I have most of this work done already at least to make it useful for Watching Multiple Buckets, I am offering for the team to assign this Story (NIFI-6762) to me if you folks decide you are interested in the Processor. One piece of help I could use, if you or someone on the team gets the chance... I am (oddly) struggling with my clean env local maven build of the full NiFi project. This was before even trying to change any code (or even cut a branch) on the master branch of the NiFi project. Please see my message to dev mailing list with subject* Maven Build Error - nifi-properties-loader sub-project test failures. *If you have any suggestions about how I can get past or work around this issue please let me know. Thanks. Aram S. Openden Senior Software Engineer Cleveland, OH [email protected] 330-283-3030 (mobile) On Tue, Sep 17, 2019 at 4:01 PM Joe Witt <[email protected]> wrote: > Hello > > This is certainly an important feature space and a commonly used > processor/pattern (ListS3) and in that spirit it would be good to > collaborate, contribute, etc.. > > We've seen similar asks come up for other List* processors as well. The > general idea is that some source data will trigger the desire to scan a > specified thing going forward for updates/listings. > > The purpose of the List* processors is 'once told to look at a thing' it > should generate a listing of what it is there AND it should continue to > look at the thing to see what changes/shows up later. Arguably we should > have called that Watch*. > > If your use case is 'once told to look at a thing' it should generate a > listing of what is there at that time and not worry about later updates > happening because you can always redo a listing again later. The > difference there is subtle but important. And for this case/desire the > name 'List*' makes a lot of sense and may be why we've seen folks want it > to work this way rather than how it does. > > For the way the List* processors are generally designed the desire to have > flowfile specific indicators of which paths to look at doesn't really align > to how it is designed. The problem is hard because there is an inherently > growing list of things to monitor and often a growing list of contents to > monitor. This makes the approach and understanding of what it is happening > really difficult. > > You mention wanting to alter how state is managed and your reasoning makes > sense. You need additional qualifiers now beyond just the name of the > thing such as the bucket in which the thing exists. Any state handling > changes for ListS3 should honor the old model where if the bucket isn't > specified it would come from the bucket of the ListS3 processor itself. > This isn't a deal breaker but it is important to consider. > > Two alternatives in general worth exploring > 1) Using SQS as a notifier of when new items show up in S3 is a good > model. It would be something like our GetSQS feeding into FetchS3, etc.. > This may be a better/more scaleable approach in general. > 2) The common case is there is a discrete set of buckets that will be > listed. You could have a flow which has a ListS3 for each such bucket. > Ahead of these ListS3 processors then could be a RouteOnAttribute processor > which looks at a specific bucket name in a ff attribute and routes to the > proper ListS3 processor. > > There may be details I'm overlooking here or I may have misunderstood > portions of your note. But the above concerns in general are important to > keep in mind for the List/Fetch processors and their intended usage and > design. > > If you started from scratch and didn't worry about ListS3 you could > generate a WatchS3 that takes multiple buckets specified on input or allows > them to arrive dynamically in the form of flow file attributes. You'd want > to document the maximum number of buckets it will monitor at once, maximum > amount of state/storage it can have to track the various listings in > various buckets, and so on. > > I would avoid the prefix property concept. In general you want to avoid > binding a property value to the life of a processor property. There are > proper lifecycle hooks to handle those changes/cases. > > Thanks > Joe > > On Tue, Sep 17, 2019 at 3:06 PM Aram Openden <[email protected]> > wrote: > > > X-Posted to NiFi Users Mailing List > > <http://apache-nifi-users-list.2361937.n4.nabble.com/>. > > > > The team I work in is doing a good deal of work with NiFi S3 Processors > > amongst others and writing some of our own custom processors. Our team > has > > a similar use-case requirement for a variation on the ListS3 Processor as > > Martijn Dekkers in this post here > > < > > > http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-tp5777p5850.html > > >. > > > > > > For context, the reader may wish to refer to this entire thread from the > > beginning > > < > > > http://apache-nifi-users-list.2361937.n4.nabble.com/Listing-S3-td5777.html#a5850 > > > > > . > > > > In our case we would like the processor to allow for incoming FlowFiles > and > > be able to change the S3 bucket it "listens to" by making the s3.bucket > > attribute modifiable using the NiFi expression language while continuing > to > > maintain the internal state of the Processor. We would simultaneously > > restrict the prefix property to be updated, making it a fixed value for > the > > entire lifetime of the Processor's running. In other words, we want a > > WatchMultipleS3Buckets Processor that maintains state for multiple > Buckets. > > > > To make this work requires a change in the state management behavior of > the > > processor. The currentKeys field is a Set that holds the collection of > > unique "keys" (filenames) that correspond to each of the StateMap's file > > entry that it is tracking. Each key is the S3 Object's associated " > > *filename*". > > > > In practice this means, that our new processor would modify the state > store > > logic. Currently, the value for each entry in the StateMap is simply the > > filename of the S3 object. Our suggested change in the StateMap's HashMap > > would have this value now be of the *bucketName + some delimiter + > > filename* of the S3 Object. > > > > Our team is working on our variation of this *WatchMultipleS3Buckets*. We > > would like to offer to contribute back this effort as follows. Since > there > > will be a great deal of common code between the current ListS3 Processor > > and our newly proposed WatchMultipleS3Buckets Processor, I propose a > > refactoring to create a new Abstract class: *AbstractS3WatchProcessor* > with > > the existing ListS3 and the newly created WatchMultipleS3Buckets as > > subclasses of this new AbstractS3WatchProcessor. > > > > Is this additional Processor & modification something the community would > > be interested in? We are asking because we want to know if this is a > > direction that the community would like to go in with the existing ListS3 > > processor. We will be happy to do the work to contribute this Processor > > variation back to the project, but would prefer not to put the extra work > > in to contributing *if that is not the desired direction* by the NiFi > > Maintainers. > > > > If yes, the new Processor contribution is desired, should I simply go > ahead > > and add a new item to the NiFi project JIRA here > > <https://issues.apache.org/jira/projects/NIFI> and then follow section 6 > > < > > > https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-providingCodeOrDocumentationContributionProvidingcodeordocumentationcontributions > > > > > (Providing > > code or documentation contributions) of the Contributor Guide? > > > > Thanks > > > > --aramcodez > > >
