Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Joey Tran Fri, 06 Feb 2026 14:52:55 -0800

On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]> wrote:


> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]>
> wrote:
> >
> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick <
> [email protected]> wrote:
> >>
> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <[email protected]>
> wrote:
> >>>
> >>> FWIW, much of the value of this proposal to me is the better
> readability from not having to consider multiple versions of transforms and
> not having to break up chains to extract main outputs. I appreciate though
> that we'd be making a trade-off of readability of the "sad path" for
> readability of the "happy path"
> >>
> >>
> >> Yeah, that makes sense; what do you think of the other alternative
> mentioned as an option for optimizing for both kinds of readability?
> Specifically, allowing:
> >>
> >>    pcoll | Partition(...)['main'] | ChainedParDo()
> >>
> >> I guess the downside there is education (all pipeline authors need to
> know this is an option as opposed to only one expert transform author), but
> I'm curious if it is sufficient for your context.
> >
> > Is the suggestion here to implement `__getitem__` on PTransform/ParDo so
> a particular pcollection can be specified? This would definitely be an
> improvement from the current state. I think one further improvement would
> be if we could specify the pcollection by attribute rather than by
> key/string, so `Partition(...).main` instead, but that risks pcollection
> name and ptransform method collisions.
> >
> > I'm still partial toward the other suggestions, particularly towards
> implementing `PTransform.with_outputs`, but this is probably sufficient for
> my context.
>
> I'll admit that I'm actually not a fan of with_outputs(...). It's not
> very dry--I'd rather the consumer decide what it wants to consume by
> consuming it than have to also (redundantly) specify it on the
> producer. I think it dates back to trying to copy java where the
> return type needs to be a typed PValue. Were I to do it again, I would
> have such transforms return a dict or named tuple (if all outputs are
> meaningful) or an "augmented" PCollection (as has been proposed here)
> when they are auxiliary (and preferably leave the decision up to the
> DoFn implementor, not the caller).
>
> - Robert
>

Ha, yeah I also don't find it the most intuitively named / parametrized. I
usually need to look at it's documentation each time I need to use it.
Standardization is nice though.

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to