Github user joemeszaros commented on the pull request:

    https://github.com/apache/nifi/pull/92#issuecomment-142654373
  
    I have several tracking event files, containing user interactions, e.g. 
user.x liked item.y in the following format:
    
    |UserId  | Action | ItemId |
    | ------------- | ------------- | ------------- |
    | user.x | like  | item.y |
    | user.xx | like  | item.z |
    |...||
    
    I need to enrich these event files e.g. with the title of the associated 
item from a separate item file, containing the item metadata:
    
    |ItemId  | Title |
    | ------------- | ------------- |
    | item.y | Title for item.y  |
    | item.z | Title for item.z  |
    |...||
    
    and the enriched event file should like this:
    
    |UserId  | Action | ItemId | Title
    | ------------- | ------------- | ------------- | ------------- |
    | user.x | like  | item.y | Title for item.y|
    | user.xx | like  | item.z | Title for item.z|
    
    My idea was to cache the item file in a distributed cache, because it is a 
typical controller service functionality, and use the same cache to extend the 
event files one-by-one, when looking for a title, based on the ItemId. In that 
case I need to read the item file only once. I created a workflow, which grabs 
the item file, creates a flow file for each item (each line), where the ItemId 
is added as a custom flow file attribute and puts those flow files into the 
distributed cache, using the PutDistributedMapCache processor. The cache key is 
the custom ItemId attribute, and the metadata is the cache value. During the 
event file enrichment I use this item catalogue cache to look for an ItemId and 
get e.g. the title. 
    
    (My workflow is not so simple, because I use JSON conversion, and 
additional processors as well)
    
    The DetectDuplicate was not an appropriate processor for me, because (as it 
names suggests) it is used for duplicate detection and caches a custom flow 
file attribute, not the flow file content.
    
    I hope I was able to highlight my rationality behind this new processor  :-)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to