What is exactly an output PCollection in your example? Is it just a PCollection of pairs (email and attachment) or it’s like PCollection<Record>, where Record can be either email or attachment? Or it is something else?
Could you add a simple example with expected input/output of your pipeline? > On 18 Jan 2021, at 12:26, Tucker Barbour <tucker.barb...@gmail.com> wrote: > > I have a use-case where I'm extracting embedded items from archive file > formats which themselves have embedded items. For example a zip file with > emails with attachments. The goal in this example would be to create a > PCollection where each email is an element as well as each attachment being > an element. (No need to create a tree structure here.) There are certain > criteria which would prevent continuing embedded item extraction, such as an > item SHA being present in a "rejection" list. The pipeline will perform a > series of transformations on the items and then continue to extract embedded > items. This type of problem lends itself to be solved with an iterative > algorithm. My understanding is that BEAM does not support iterative > algorithms to the same extent Spark does. In BEAM I would have to persist the > results of each iteration and instantiate a new pipeline for each iteration. > This _works_ though isn't ideal. The "rejection" list is a PCollection of a > few million elements. Re-reading this "rejection" list on each iteration > isn't ideal. > > Is there a way to write such an iterative algorithm in BEAM without having to > create a new pipeline on each iteration? Further, would something like > SplitableDoFn be a potential solution? Looking through the user's guide and > some existing implementations of SplitableDoFns I'm thinking not but I'm > still trying to understand SplitableDoFns.