Github user joemeszaros commented on the pull request:
https://github.com/apache/nifi/pull/92#issuecomment-142654373
I have several tracking event files, containing user interactions, e.g.
user.x liked item.y in the following format:
|UserId | Action | ItemId |
| ------------- | ------------- | ------------- |
| user.x | like | item.y |
| user.xx | like | item.z |
|...||
I need to enrich these event files e.g. with the title of the associated
item from a separate item file, containing the item metadata:
|ItemId | Title |
| ------------- | ------------- |
| item.y | Title for item.y |
| item.z | Title for item.z |
|...||
and the enriched event file should like this:
|UserId | Action | ItemId | Title
| ------------- | ------------- | ------------- | ------------- |
| user.x | like | item.y | Title for item.y|
| user.xx | like | item.z | Title for item.z|
My idea was to cache the item file in a distributed cache, because it is a
typical controller service functionality, and use the same cache to extend the
event files one-by-one, when looking for a title, based on the ItemId. In that
case I need to read the item file only once. I created a workflow, which grabs
the item file, creates a flow file for each item (each line), where the ItemId
is added as a custom flow file attribute and puts those flow files into the
distributed cache, using the PutDistributedMapCache processor. The cache key is
the custom ItemId attribute, and the metadata is the cache value. During the
event file enrichment I use this item catalogue cache to look for an ItemId and
get e.g. the title.
(My workflow is not so simple, because I use JSON conversion, and
additional processors as well)
The DetectDuplicate was not an appropriate processor for me, because (as it
names suggests) it is used for duplicate detection and caches a custom flow
file attribute, not the flow file content.
I hope I was able to highlight my rationality behind this new processor :-)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---