Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Joey Tran Thu, 05 Feb 2026 13:44:40 -0800

Thanks for such quick feedback!


> On Thu, Feb 5, 2026 at 3:27 PM Danny McCormick via dev <
> [email protected]> wrote:
>
>> Would you mind opening up the doc for comments?
>>
>> At a high level, I'm skeptical of the pattern; it seems to me like it
>> moves the burden of choosing the correct behavior from authors to consumers
>> in non-obvious ways which range from harmless to potentially causing silent
>> data loss. I think if a user wants to drop a PCollection, that should
>> always be an active choice since the risk of data loss is much greater than
>> the EoU benefit of extra code.
>>
>> I think perhaps I poorly chose a few motivating examples, but it was at
least helpful in clarifying two distinct patterns.
  - Filters/Samplers/Deduplicators
  - Transforms that may run into issues with certain inputs


> I'd argue that a better pattern than having a single transform which
>> handles this is to either have a *Filter *or a *Partition* transform
>> which a user can use as needed. These are different transforms because they
>> have different purposes/core functionalities.
>>
>> This can become unwieldy for a large library of filtering / sampling /
data processing transforms. At Schrodinger for example, we may have maybe a
dozen transforms some of which...
  - are samplers where most consumers will just be interested in the
"sample", while other consumers may be interested in both the sample and
remaining
  - are data processing transforms with a concept of processed outputs and
"dropped for well-understood reason"

We'd likely need to double the size of our library in order to have both
Filter and Partition versions of these transforms.


> > A parser that routes malformed input to a dead-letter output
>> > A validator that routes violations separately
>> > An enrichment that routes lookup failures aside
>>
>> These are the ones I'm really worried about. In all of these cases, we
>> are silently dropping error output in a way that might be non-obvious to a
>> user. As a user, if I use a parser that returns a single output, I would
>> assume that any parse failures would lead to exceptions.
>>
>> I agree that it'd be an antipattern for these types of transforms to
silently capture and drop these erroneous records, but there is nothing
preventing an author of parser/validator/enrichment transform from doing
this today even without ForkedPCollections. With ForkedPCollections, I
think we can and still should discourage transform authors from silently
handling errors without some active user configuration (e.g. by requiring
as a keyword arg `error_handling_pcoll_name= "failed" to enable any error
capturing at all). e.g.
```
parsed = pcoll | ParseData()
# parsed.failed --> should not exist, ParseData should not automatically do
this

parsed = pcoll | ParseData(failed_pcoll_tag="failed")
# parsed.failed --> exists now but only with user intent
```



> With all that said, I am aligned with the goal of making pipelines like
>> this easier to chain. Maybe an in between option would be adding a DoFn
>> utility like:
>>
>> ```
>> pcoll | Partition(...).keep_tag('main') | ChainedParDo()
>> ```
>>
>> Where `keep_tag` forces an expansion where all tags other than main are
>> dropped. What do you think?
>>
>> This would help but this solution would be limited to ParDos. If you have
a composite transform like a sophisticated `CompositeFilter` or
`CompositeSampler`, then you wouldn't be able to use `.keep_tag`.

Best,
Joey








Thanks,
>> Danny
>>
>> On Thu, Feb 5, 2026 at 3:04 PM Joey Tran <[email protected]>
>> wrote:
>>
>>> Hey everyone,
>>>
>>> My team and I have been running into an awkward pattern with the python
>>> and YAML SDK when we have transforms that have one "main" output that we
>>> want to be able to ergonomically chain, and other "side" outputs that are
>>> useful in some situations. I put together a brief design proposal for a new
>>> PCollection type to make this easier - would appreciate any feedback or
>>> thoughts. Open to different names as well.
>>>
>>> ForkedPCollection Design Doc
>>> <https://docs.google.com/document/d/10kx8hVrF8JfdeIS6X1vjiADk0IRMTVup9aX3u-GthOo/edit?tab=t.0>
>>>
>>> Thanks!
>>> Joey
>>>
>>> --
>>>
>>> Joey Tran | Staff Developer | AutoDesigner TL
>>>
>>> *he/him*
>>>
>>> [image: Schrödinger, Inc.] <https://schrodinger.com/>
>>>
>>

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to