DetectDuplicateRecord Processor

Adam Fisher Sat, 16 Feb 2019 18:20:20 -0800

Hello NiFi developers! I'm new to NiFi and decided to create a
*DetectDuplicateRecord
*processor. Mike Thomsen also created an implementation about the same
time. It was suggested we open this up for discussion with the community to
identify use cases.


Below are the two implementations each with their respective properties.

   - https://issues.apache.org/jira/browse/NIFI-6014
   - *Record Reader*
      - *Record Writer*
      - *Cache Service*
      - *Lookup Record Path:* The record path operation to use for
      generating the lookup key for each record.
      - *Cache Value Strategy:* This determines what will be written to the
      cache from the record. It can be either a literal value or the
result of a
      record path operation.
      - *Cache Value: *This is the value that will be written to the cache
      at the appropriate record and record key if it does not exist.
      - *Don't Send Empty Record Sets: *Same as "Include Zero Record
      FlowFiles" below

      - https://issues.apache.org/jira/browse/NIFI-6047
   - *Record Reader*
      -
*Record Writer *
      - *Include Zero Record FlowFiles*
      - *Cache The Entry Identifier:* Similar to DetectDuplicate
      - *Distributed Cache Service:* Similar to DetectDuplicate
      - *Age Off Duration:* Similar to DetectDuplicate
      - *Record Hashing Algorithm:* The algorithm used to hash the combined
      result of RecordPath values in the cache.
      - *Filter Type: *The filter used to determine whether a record has
      been seen before based on the matching RecordPath criteria defined by
      user-defined properties. Current options are *HashSet* or
      *BloomFilter*.
      - *Filter Capacity Hint:* An estimation of the total number of unique
      records to be processed.
      - *BloomFilter Probability:* The desired false positive probability
      when using the BloomFilter filter type.
      - *<User Defined Properties>:* The name of the property is a record
      path. All record paths are resolved on each record to determine
the unique
      value for a record. The value of the user-defined property is ignored.
      Initial thought however was to make the value expose field variables sort
      of how UpdateRecord does (i.e. ${field.value})


There are many ways duplicate records could be detected. Offering the user
the ability to:

   - *Specify the cache identifier* means users can use the same identifier
   in different DetectDuplicateRecord blocks in different process groups.
   Specifying a unique name based on the file name for example will conversely
   isolate the unique check to just the daily load of a specific file.
   - *Set a cache expiration* lets users do things like set it to last for
   24 hours so we only store unique cache information from one day to the
   next. This is useful when you are doing a daily file load and you only want
   to process the new records or the records that changed.
   - *Select a filter type* will allow you to optimize for memory usage. I
   need to process multi-GB sized files and keeping a hash of each of those is
   going to get expensive with a HashSet in memory. But offering a BloomFilter
   is acceptable especially when you are doing database operations downstream
   and don't care if you have some false positives but it will reduce the
   number of attempted duplicate inserts/updates you perform.


Here's to hoping this finds you all warm and well. I love this software!


Adam

DetectDuplicateRecord Processor

Reply via email to