Hi all,

I'm in the process of migrating an existing pipeline over from the Kafka 
Streams framework to Apache Beam and I'm reaching out in hopes of finding a 
pattern that could address the use-case I'm currently dealing with.

In the most trivial sense, the scenario just involves an incoming message (an 
event) within the pipeline that may contain multiple different entities (users, 
endpoints, ip addresses, etc.) and I'm trying to take the identifiers from each 
of those entities and enrich the event from a source (Kafka) that has more 
information on them.

The workflow might go something like this:
 - Read in incoming event containing a single user entity from a Kafka topic
 - Perform the equivalent of a LEFT JOIN on the user identifier against another 
Kafka topic that contains additional user information
 - If the join succeeds, mutate the existing user instance on the event with 
any additional information from the source topic
 - Write the "enriched" event to a destination (e.g. Kafka, Elastic, etc.)

What would be the best pattern to try and tackle this problem? I know that side 
inputs are an option, however the source user topic could potentially contain 
millions of distinct users (with any "new" users being added to that source 
topic in near-real time, albeit upstream from this process).

Any information or recommendations would be greatly appreciated!

Thanks,

Rion

Reply via email to