On Thu, Feb 12, 2026 at 4:47 PM Valentyn Tymofieiev <[email protected]> wrote: > > > Were I to do it again, I would have such transforms return a dict or named > > tuple (if all outputs are > meaningful) or an "augmented" PCollection (as has been proposed here) > when they are auxiliary (and preferably leave the decision up to the > DoFn implementor, not the caller). > > Regarding the "augmented PCollection" concept, would it be feasible to think > of a design where every PCollection is implicitly a container that has side > outputs? In this world, a standard PCollection is a the corner case with 0 > side outputs. I wonder if this could help avoid introducing a new distinct > type like PCollectionWithSideOutputs. > > Looking at the code snippet below > > results = (p | Create(...) > | ParDo(...).with_outputs('side_output_tag', main='main_tag')) > > # This currently fails with _InvalidUnpickledPCollection errors > results | LogElements() > > > This code is failing, since I don't specify the main output, so I think Beam > treats the DoOutputsTuple as an iterable of data elements (the PCollections > themselves) and maybe tries to Create() a new PCollection from them. However > I explicitly specify which output is main. What if DoOutputsTuple in this > case supported chaining off the 'main' PColl in this case?
Are there any PTransforms that accept a DoOutputsTuple? (Or, if there are, can we identify them?) This is the primary downside I see to this route. > On Thu, Feb 12, 2026 at 2:52 PM Danny McCormick via dev <[email protected]> > wrote: >> >> My preference would be enabling `pcoll | Partition(...)['main'] | >> ChainedParDo()`, but I think I'm currently the only one with significant >> objections - I tried to make time for someone to join my dissent :) >> >> Given that, I'm ok with proceeding with roughly the original proposal >> (factoring conversation in the doc); my only request would be that we >> document the transform in a way that clearly discourages putting >> error/exception outputs in the secondary PCollection, and makes it clear >> that this is primarily for use cases where the main PCollection is >> sufficient for most use cases. +1 >> On Tue, Feb 10, 2026 at 4:42 PM Joey Tran <[email protected]> wrote: >>> >>> Just want to bump this. In what direction should we go here? >>> >>> On Fri, Feb 6, 2026 at 5:49 PM Joey Tran <[email protected]> wrote: >>>> >>>> >>>> >>>> On Fri, Feb 6, 2026 at 5:43 PM Robert Bradshaw <[email protected]> wrote: >>>>> >>>>> On Fri, Feb 6, 2026 at 2:36 PM Joey Tran <[email protected]> >>>>> wrote: >>>>> > >>>>> > On Fri, Feb 6, 2026 at 4:43 PM Danny McCormick >>>>> > <[email protected]> wrote: >>>>> >> >>>>> >> On Fri, Feb 6, 2026 at 4:22 PM Joey Tran <[email protected]> >>>>> >> wrote: >>>>> >>> >>>>> >>> FWIW, much of the value of this proposal to me is the better >>>>> >>> readability from not having to consider multiple versions of >>>>> >>> transforms and not having to break up chains to extract main outputs. >>>>> >>> I appreciate though that we'd be making a trade-off of readability of >>>>> >>> the "sad path" for readability of the "happy path" >>>>> >> >>>>> >> >>>>> >> Yeah, that makes sense; what do you think of the other alternative >>>>> >> mentioned as an option for optimizing for both kinds of readability? >>>>> >> Specifically, allowing: >>>>> >> >>>>> >> pcoll | Partition(...)['main'] | ChainedParDo() >>>>> >> >>>>> >> I guess the downside there is education (all pipeline authors need to >>>>> >> know this is an option as opposed to only one expert transform >>>>> >> author), but I'm curious if it is sufficient for your context. >>>>> > >>>>> > Is the suggestion here to implement `__getitem__` on PTransform/ParDo >>>>> > so a particular pcollection can be specified? This would definitely be >>>>> > an improvement from the current state. I think one further improvement >>>>> > would be if we could specify the pcollection by attribute rather than >>>>> > by key/string, so `Partition(...).main` instead, but that risks >>>>> > pcollection name and ptransform method collisions. >>>>> > >>>>> > I'm still partial toward the other suggestions, particularly towards >>>>> > implementing `PTransform.with_outputs`, but this is probably sufficient >>>>> > for my context. >>>>> >>>>> I'll admit that I'm actually not a fan of with_outputs(...). It's not >>>>> very dry--I'd rather the consumer decide what it wants to consume by >>>>> consuming it than have to also (redundantly) specify it on the >>>>> producer. I think it dates back to trying to copy java where the >>>>> return type needs to be a typed PValue. Were I to do it again, I would >>>>> have such transforms return a dict or named tuple (if all outputs are >>>>> meaningful) or an "augmented" PCollection (as has been proposed here) >>>>> when they are auxiliary (and preferably leave the decision up to the >>>>> DoFn implementor, not the caller). >>>>> >>>>> - Robert >>>> >>>> >>>> Ha, yeah I also don't find it the most intuitively named / parametrized. I >>>> usually need to look at it's documentation each time I need to use it. >>>> Standardization is nice though.
