Hi all, I'm in the process of migrating an existing pipeline over from the Kafka Streams framework to Apache Beam and I'm reaching out in hopes of finding a pattern that could address the use-case I'm currently dealing with.
In the most trivial sense, the scenario just involves an incoming message (an event) within the pipeline that may contain multiple different entities (users, endpoints, ip addresses, etc.) and I'm trying to take the identifiers from each of those entities and enrich the event from a source (Kafka) that has more information on them. The workflow might go something like this: - Read in incoming event containing a single user entity from a Kafka topic - Perform the equivalent of a LEFT JOIN on the user identifier against another Kafka topic that contains additional user information - If the join succeeds, mutate the existing user instance on the event with any additional information from the source topic - Write the "enriched" event to a destination (e.g. Kafka, Elastic, etc.) What would be the best pattern to try and tackle this problem? I know that side inputs are an option, however the source user topic could potentially contain millions of distinct users (with any "new" users being added to that source topic in near-real time, albeit upstream from this process). Any information or recommendations would be greatly appreciated! Thanks, Rion
