Hello NiFi developers! I'm new to NiFi and decided to create a *DetectDuplicateRecord *processor. Mike Thomsen also created an implementation about the same time. It was suggested we open this up for discussion with the community to identify use cases.
Below are the two implementations each with their respective properties. - https://issues.apache.org/jira/browse/NIFI-6014 - *Record Reader* - *Record Writer* - *Cache Service* - *Lookup Record Path:* The record path operation to use for generating the lookup key for each record. - *Cache Value Strategy:* This determines what will be written to the cache from the record. It can be either a literal value or the result of a record path operation. - *Cache Value: *This is the value that will be written to the cache at the appropriate record and record key if it does not exist. - *Don't Send Empty Record Sets: *Same as "Include Zero Record FlowFiles" below - https://issues.apache.org/jira/browse/NIFI-6047 - *Record Reader* - *Record Writer * - *Include Zero Record FlowFiles* - *Cache The Entry Identifier:* Similar to DetectDuplicate - *Distributed Cache Service:* Similar to DetectDuplicate - *Age Off Duration:* Similar to DetectDuplicate - *Record Hashing Algorithm:* The algorithm used to hash the combined result of RecordPath values in the cache. - *Filter Type: *The filter used to determine whether a record has been seen before based on the matching RecordPath criteria defined by user-defined properties. Current options are *HashSet* or *BloomFilter*. - *Filter Capacity Hint:* An estimation of the total number of unique records to be processed. - *BloomFilter Probability:* The desired false positive probability when using the BloomFilter filter type. - *<User Defined Properties>:* The name of the property is a record path. All record paths are resolved on each record to determine the unique value for a record. The value of the user-defined property is ignored. Initial thought however was to make the value expose field variables sort of how UpdateRecord does (i.e. ${field.value}) There are many ways duplicate records could be detected. Offering the user the ability to: - *Specify the cache identifier* means users can use the same identifier in different DetectDuplicateRecord blocks in different process groups. Specifying a unique name based on the file name for example will conversely isolate the unique check to just the daily load of a specific file. - *Set a cache expiration* lets users do things like set it to last for 24 hours so we only store unique cache information from one day to the next. This is useful when you are doing a daily file load and you only want to process the new records or the records that changed. - *Select a filter type* will allow you to optimize for memory usage. I need to process multi-GB sized files and keeping a hash of each of those is going to get expensive with a HashSet in memory. But offering a BloomFilter is acceptable especially when you are doing database operations downstream and don't care if you have some false positives but it will reduce the number of attempted duplicate inserts/updates you perform. Here's to hoping this finds you all warm and well. I love this software! Adam
