Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Joey Tran Thu, 12 Feb 2026 17:54:09 -0800

On Thu, Feb 12, 2026 at 8:23 PM Robert Bradshaw via dev <[email protected]>
wrote:


> On Thu, Feb 12, 2026 at 4:47 PM Valentyn Tymofieiev <[email protected]>
> wrote:
> >
> > >  Were I to do it again, I would have such transforms return a dict or
> named tuple (if all outputs are
> > meaningful) or an "augmented" PCollection (as has been proposed here)
> > when they are auxiliary (and preferably leave the decision up to the
> > DoFn implementor, not the caller).
> >
> > Regarding the "augmented PCollection" concept, would it be feasible to
> think of a design where every PCollection is implicitly a container that
> has side outputs? In this world, a standard PCollection is a the corner
> case with 0 side outputs. I wonder if this could help avoid introducing a
> new distinct type like PCollectionWithSideOutputs.
> >
>

Big +1 from me. I've been tripped up many times from `.with_outputs`
changing the result of a ParDo transform from a PCollection to a tuple, and
I've seen other users similarly confused.


> > Looking at the code snippet below
> >
> > results = (p | Create(...)
> >              | ParDo(...).with_outputs('side_output_tag',
> main='main_tag'))
> >
> > # This currently fails with _InvalidUnpickledPCollection errors
> > results | LogElements()
> >
> >
> > This code is failing, since I don't specify the main output, so I think
> Beam treats the DoOutputsTuple as an iterable of data elements (the
> PCollections themselves) and maybe tries to Create() a new PCollection from
> them. However I explicitly specify which output is main. What if
> DoOutputsTuple in this case supported chaining off the 'main' PColl in this
> case?
>
> Are there any PTransforms that accept a DoOutputsTuple? (Or, if there
> are, can we identify them?) This is the primary downside I see to this
> route.
>

I'm guessing there are probably PTransforms out there somewhere that rely
on this behavior at this point. But maybe we can sidestep backwards
compatibility and just add a new method to use "side outputs", e.g.
`.with_side_outputs`? I think the semantic difference between
`.with_outputs` and `.with_side_outputs` is relatively clear.


>
> > On Thu, Feb 12, 2026 at 2:52 PM Danny McCormick via dev <
> [email protected]> wrote:
> >>
> >> My preference would be enabling `pcoll | Partition(...)['main'] |
> ChainedParDo()`, but I think I'm currently the only one with significant
> objections - I tried to make time for someone to join my dissent :)
> >>
> >> Given that, I'm ok with proceeding with roughly the original proposal
> (factoring conversation in the doc); my only request would be that we
> document the transform in a way that clearly discourages putting
> error/exception outputs in the secondary PCollection, and makes it clear
> that this is primarily for use cases where the main PCollection is
> sufficient for most use cases.
>
>
When you say `document the transform`, what transform are you referring to?
Or do you mean putting a warning in the docstring of
PCollectionWithSideOutputs?


> +1
>
> >> On Tue, Feb 10, 2026 at 4:42 PM Joey Tran <[email protected]>
> wrote:
> >>>
> >>> Just want to bump this. In what direction should we go here?
> >>>
> >>> On Fri, Feb 6, 2026 at 5:49 PM Joey Tran <[email protected]>
> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]>
> wrote:
> >>>>>
> >>>>> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]>
> wrote:
> >>>>> >
> >>>>> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick <
> [email protected]> wrote:
> >>>>> >>
> >>>>> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <
> [email protected]> wrote:
> >>>>> >>>
> >>>>> >>> FWIW, much of the value of this proposal to me is the better
> readability from not having to consider multiple versions of transforms and
> not having to break up chains to extract main outputs. I appreciate though
> that we'd be making a trade-off of readability of the "sad path" for
> readability of the "happy path"
> >>>>> >>
> >>>>> >>
> >>>>> >> Yeah, that makes sense; what do you think of the other
> alternative mentioned as an option for optimizing for both kinds of
> readability? Specifically, allowing:
> >>>>> >>
> >>>>> >>    pcoll | Partition(...)['main'] | ChainedParDo()
> >>>>> >>
> >>>>> >> I guess the downside there is education (all pipeline authors
> need to know this is an option as opposed to only one expert transform
> author), but I'm curious if it is sufficient for your context.
> >>>>> >
> >>>>> > Is the suggestion here to implement `__getitem__` on
> PTransform/ParDo so a particular pcollection can be specified? This would
> definitely be an improvement from the current state. I think one further
> improvement would be if we could specify the pcollection by attribute
> rather than by key/string, so `Partition(...).main` instead, but that risks
> pcollection name and ptransform method collisions.
> >>>>> >
> >>>>> > I'm still partial toward the other suggestions, particularly
> towards implementing `PTransform.with_outputs`, but this is probably
> sufficient for my context.
> >>>>>
> >>>>> I'll admit that I'm actually not a fan of with_outputs(...). It's not
> >>>>> very dry--I'd rather the consumer decide what it wants to consume by
> >>>>> consuming it than have to also (redundantly) specify it on the
> >>>>> producer. I think it dates back to trying to copy java where the
> >>>>> return type needs to be a typed PValue. Were I to do it again, I
> would
> >>>>> have such transforms return a dict or named tuple (if all outputs are
> >>>>> meaningful) or an "augmented" PCollection (as has been proposed here)
> >>>>> when they are auxiliary (and preferably leave the decision up to the
> >>>>> DoFn implementor, not the caller).
> >>>>>
> >>>>> - Robert
> >>>>
> >>>>
> >>>> Ha, yeah I also don't find it the most intuitively named /
> parametrized. I usually need to look at it's documentation each time I need
> to use it.  Standardization is nice though.
>

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to