What is exactly an output PCollection in your example? Is it just a PCollection 
of pairs (email and attachment) or it’s like PCollection<Record>, where Record 
can be either email or attachment? Or it is something else?

Could you add a simple example with expected input/output of your pipeline?

> On 18 Jan 2021, at 12:26, Tucker Barbour <tucker.barb...@gmail.com> wrote:
> 
> I have a use-case where I'm extracting embedded items from archive file 
> formats which themselves have embedded items. For example a zip file with 
> emails with attachments. The goal in this example would be to create a 
> PCollection where each email is an element as well as each attachment being 
> an element. (No need to create a tree structure here.) There are certain 
> criteria which would prevent continuing embedded item extraction, such as an 
> item SHA being present in a "rejection" list. The pipeline will perform a 
> series of transformations on the items and then continue to extract embedded 
> items. This type of problem lends itself to be solved with an iterative 
> algorithm. My understanding is that BEAM does not support iterative 
> algorithms to the same extent Spark does. In BEAM I would have to persist the 
> results of each iteration and instantiate a new pipeline for each iteration. 
> This _works_ though isn't ideal. The "rejection" list is a PCollection of a 
> few million elements. Re-reading this "rejection" list on each iteration 
> isn't ideal.
> 
> Is there a way to write such an iterative algorithm in BEAM without having to 
> create a new pipeline on each iteration? Further, would something like 
> SplitableDoFn be a potential solution? Looking through the user's guide and 
> some existing implementations of SplitableDoFns I'm thinking not but I'm 
> still trying to understand SplitableDoFns.

Reply via email to