Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Robert Bradshaw via dev Thu, 12 Feb 2026 17:22:38 -0800

On Thu, Feb 12, 2026 at 4:47 PM Valentyn Tymofieiev <[email protected]> wrote:
>
> >  Were I to do it again, I would have such transforms return a dict or named 
> > tuple (if all outputs are
> meaningful) or an "augmented" PCollection (as has been proposed here)
> when they are auxiliary (and preferably leave the decision up to the
> DoFn implementor, not the caller).
>
> Regarding the "augmented PCollection" concept, would it be feasible to think 
> of a design where every PCollection is implicitly a container that has side 
> outputs? In this world, a standard PCollection is a the corner case with 0 
> side outputs. I wonder if this could help avoid introducing a new distinct 
> type like PCollectionWithSideOutputs.
>
> Looking at the code snippet below
>
> results = (p | Create(...)
>              | ParDo(...).with_outputs('side_output_tag', main='main_tag'))
>
> # This currently fails with _InvalidUnpickledPCollection errors
> results | LogElements()
>
>
> This code is failing, since I don't specify the main output, so I think Beam 
> treats the DoOutputsTuple as an iterable of data elements (the PCollections 
> themselves) and maybe tries to Create() a new PCollection from them. However 
> I explicitly specify which output is main. What if DoOutputsTuple in this 
> case supported chaining off the 'main' PColl in this case?


Are there any PTransforms that accept a DoOutputsTuple? (Or, if there
are, can we identify them?) This is the primary downside I see to this
route.

> On Thu, Feb 12, 2026 at 2:52 PM Danny McCormick via dev <[email protected]> 
> wrote:
>>
>> My preference would be enabling `pcoll | Partition(...)['main'] | 
>> ChainedParDo()`, but I think I'm currently the only one with significant 
>> objections - I tried to make time for someone to join my dissent :)
>>
>> Given that, I'm ok with proceeding with roughly the original proposal 
>> (factoring conversation in the doc); my only request would be that we 
>> document the transform in a way that clearly discourages putting 
>> error/exception outputs in the secondary PCollection, and makes it clear 
>> that this is primarily for use cases where the main PCollection is 
>> sufficient for most use cases.

+1

>> On Tue, Feb 10, 2026 at 4:42 PM Joey Tran <[email protected]> wrote:
>>>
>>> Just want to bump this. In what direction should we go here?
>>>
>>> On Fri, Feb 6, 2026 at 5:49 PM Joey Tran <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]> wrote:
>>>>>
>>>>> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]> 
>>>>> wrote:
>>>>> >
>>>>> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick 
>>>>> > <[email protected]> wrote:
>>>>> >>
>>>>> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <[email protected]> 
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> FWIW, much of the value of this proposal to me is the better 
>>>>> >>> readability from not having to consider multiple versions of 
>>>>> >>> transforms and not having to break up chains to extract main outputs. 
>>>>> >>> I appreciate though that we'd be making a trade-off of readability of 
>>>>> >>> the "sad path" for readability of the "happy path"
>>>>> >>
>>>>> >>
>>>>> >> Yeah, that makes sense; what do you think of the other alternative 
>>>>> >> mentioned as an option for optimizing for both kinds of readability? 
>>>>> >> Specifically, allowing:
>>>>> >>
>>>>> >>    pcoll | Partition(...)['main'] | ChainedParDo()
>>>>> >>
>>>>> >> I guess the downside there is education (all pipeline authors need to 
>>>>> >> know this is an option as opposed to only one expert transform 
>>>>> >> author), but I'm curious if it is sufficient for your context.
>>>>> >
>>>>> > Is the suggestion here to implement `__getitem__` on PTransform/ParDo 
>>>>> > so a particular pcollection can be specified? This would definitely be 
>>>>> > an improvement from the current state. I think one further improvement 
>>>>> > would be if we could specify the pcollection by attribute rather than 
>>>>> > by key/string, so `Partition(...).main` instead, but that risks 
>>>>> > pcollection name and ptransform method collisions.
>>>>> >
>>>>> > I'm still partial toward the other suggestions, particularly towards 
>>>>> > implementing `PTransform.with_outputs`, but this is probably sufficient 
>>>>> > for my context.
>>>>>
>>>>> I'll admit that I'm actually not a fan of with_outputs(...). It's not
>>>>> very dry--I'd rather the consumer decide what it wants to consume by
>>>>> consuming it than have to also (redundantly) specify it on the
>>>>> producer. I think it dates back to trying to copy java where the
>>>>> return type needs to be a typed PValue. Were I to do it again, I would
>>>>> have such transforms return a dict or named tuple (if all outputs are
>>>>> meaningful) or an "augmented" PCollection (as has been proposed here)
>>>>> when they are auxiliary (and preferably leave the decision up to the
>>>>> DoFn implementor, not the caller).
>>>>>
>>>>> - Robert
>>>>
>>>>
>>>> Ha, yeah I also don't find it the most intuitively named / parametrized. I 
>>>> usually need to look at it's documentation each time I need to use it.  
>>>> Standardization is nice though.

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to