Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Joey Tran Tue, 10 Feb 2026 13:43:06 -0800

Just want to bump this. In what direction should we go here?

On Fri, Feb 6, 2026 at 5:49 PM Joey Tran <[email protected]> wrote:


>
>
> On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]> wrote:
>
>> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]>
>> wrote:
>> >
>> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick <
>> [email protected]> wrote:
>> >>
>> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <[email protected]>
>> wrote:
>> >>>
>> >>> FWIW, much of the value of this proposal to me is the better
>> readability from not having to consider multiple versions of transforms and
>> not having to break up chains to extract main outputs. I appreciate though
>> that we'd be making a trade-off of readability of the "sad path" for
>> readability of the "happy path"
>> >>
>> >>
>> >> Yeah, that makes sense; what do you think of the other alternative
>> mentioned as an option for optimizing for both kinds of readability?
>> Specifically, allowing:
>> >>
>> >>    pcoll | Partition(...)['main'] | ChainedParDo()
>> >>
>> >> I guess the downside there is education (all pipeline authors need to
>> know this is an option as opposed to only one expert transform author), but
>> I'm curious if it is sufficient for your context.
>> >
>> > Is the suggestion here to implement `__getitem__` on PTransform/ParDo
>> so a particular pcollection can be specified? This would definitely be an
>> improvement from the current state. I think one further improvement would
>> be if we could specify the pcollection by attribute rather than by
>> key/string, so `Partition(...).main` instead, but that risks pcollection
>> name and ptransform method collisions.
>> >
>> > I'm still partial toward the other suggestions, particularly towards
>> implementing `PTransform.with_outputs`, but this is probably sufficient for
>> my context.
>>
>> I'll admit that I'm actually not a fan of with_outputs(...). It's not
>> very dry--I'd rather the consumer decide what it wants to consume by
>> consuming it than have to also (redundantly) specify it on the
>> producer. I think it dates back to trying to copy java where the
>> return type needs to be a typed PValue. Were I to do it again, I would
>> have such transforms return a dict or named tuple (if all outputs are
>> meaningful) or an "augmented" PCollection (as has been proposed here)
>> when they are auxiliary (and preferably leave the decision up to the
>> DoFn implementor, not the caller).
>>
>> - Robert
>>
>
> Ha, yeah I also don't find it the most intuitively named / parametrized. I
> usually need to look at it's documentation each time I need to use it.
> Standardization is nice though.
>

Re: [DISCUSS] Proposal: ForkedPCollections in Python SDK

Reply via email to