On Wed, Apr 1, 2020 at 3:34 PM Robert Bradshaw <[email protected]> wrote:
> E.g. something like https://github.com/apache/beam/pull/11283 > > That looks good. Though it may be better to do the change under the "passthrough_pcollection_output_ids" experiment flag above it (I'll comment on this in the PR too). > On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw <[email protected]> > wrote: > >> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <[email protected]> wrote: >> >>> To restate the original issue it is that the current method of setting >>> the output tags on PCollections from composites drops the tag information >>> of the returned PCollections. >>> >> >> Composite PTransforms should *not* be setting output tags on >> returned PCollecitons; this will break producing outputs from the actual >> primitive that produces them. >> >> >>> So an expand returning a dict will have its outputs labeled as None, 1, >>> ..., len(outputs). This is broken because embedded payloads in composites >>> won't be able to reference the outputs if they differ from the returned >>> value. >>> >> >> Yes, we need a way for composites to declare their output tags. Currently >> this is only supported for the multi-output ParDo primitive. >> > >> >>> In this case, having the restriction of no nesting decreases technical >>> complexity substantially and always giving the tag unambiguously informs >>> the SDK what to name the outputs. We can also allow for arbitrary nesting, >>> we'll just have to figure out an unambiguous naming scheme for the >>> PCollections. >>> >> >> How about this: if the returned PValue is a dictionary of string -> >> PCollection, we use the keys as tags. We can extend this naturally to >> tuples, named tuples, nesting, etc. (though I don't know if there are any >> hidden assumptions left about having an output labeled None if we want to >> push this through to completion). >> >> > Sorry for my imperfect vocabulary, this is what I was roughly trying (and failed) to propose. > >>> >>> >>> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> I'm -1 on this, it adds additional restrictions and I don't see what >>>> this buys us. (In particular, it doesn't address the original issue.) >>>> >>>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <[email protected]> wrote: >>>> >>>>> So then how does the proposal sound? >>>>> >>>>> Writing again here: >>>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, >>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection], >>>>> DoOutputsTuple] >>>>> >>>>> i.e. no arbitrary nesting when outputting from an expand >>>>> >>>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <[email protected]> wrote: >>>>>> > >>>>>> > It is important that composites know how things are named so that >>>>>> any embedded payloads in the composite PTransform can reference the >>>>>> outputs >>>>>> appropriately. >>>>>> >>>>>> Very good point. This is part of the cleanup to treat inputs and >>>>>> outputs of PCollections as maps rather than lists generally across the >>>>>> Python representations (which also relates to some of the ugliness >>>>>> that Cham has been running into with cross-language). >>>>>> >>>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw < >>>>>> [email protected]> wrote: >>>>>> >> >>>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <[email protected]> >>>>>> wrote: >>>>>> >> >>> >>>>>> >> >>> * Don't allow arbitrary nestings returned during expansion, >>>>>> force composite transforms to always provide an unambiguous name (either >>>>>> a >>>>>> tuple with PCollections with unique tags or a dictionary with untagged >>>>>> PCollections or a singular PCollection (Java and Go SDKs do this)). >>>>>> >> >> >>>>>> >> >> I believe that aligning with Java and Go would be the right way >>>>>> to go here. I don't know if this would limit expressiveness. >>>>>> >> > >>>>>> >> > Yeah this sounds like a much more elegant way of handling this >>>>>> situation. I would lean towards this limiting expressiveness because >>>>>> there >>>>>> would be a limit to nesting, but I think that the trade-off with reducing >>>>>> complexity is worth it. >>>>>> >> > >>>>>> >> > So in summary it could be: >>>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str, >>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection], >>>>>> DoOutputsTuple] >>>>>> >> > >>>>>> >> > With the expectation that (pseudo-code): >>>>>> >> > a_transform = ATransform() >>>>>> >> > >>>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() == >>>>>> a_transform.outputs.keys() >>>>>> >> > >>>>>> >> > Since this changes the Python SDK composite transform API, what >>>>>> would be the next steps for the community to come to a consensus on this? >>>>>> >> >>>>>> >> It seems here we're conflating the nesting of PValue results with >>>>>> the >>>>>> >> nesting of composite operations. >>>>>> >> >>>>>> >> Both examples in the original post have PTransform nesting (a >>>>>> >> composite) returning a flat tuple. This is completely orthogonal to >>>>>> >> the idea of a PTransform returning a nested result (such as (pc1, >>>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former. >>>>>> >> >>>>>> >> Currently, with the exception of explicit names given for >>>>>> multi-output >>>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, >>>>>> ... >>>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no >>>>>> matter >>>>>> >> the nesting. We could do better, e.g. for the example above, label >>>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but >>>>>> this >>>>>> >> is separate from the idea of trying to relate the output tags of >>>>>> >> composites to the output tags of their inner transforms. >>>>>> >> >>>>>> >> - Robert >>>>>> >>>>> On Wed, Apr 1, 2020 at 3:34 PM Robert Bradshaw <[email protected]> wrote: > E.g. something like https://github.com/apache/beam/pull/11283 > > On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw <[email protected]> > wrote: > >> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde <[email protected]> wrote: >> >>> To restate the original issue it is that the current method of setting >>> the output tags on PCollections from composites drops the tag information >>> of the returned PCollections. >>> >> >> Composite PTransforms should *not* be setting output tags on >> returned PCollecitons; this will break producing outputs from the actual >> primitive that produces them. >> >> >>> So an expand returning a dict will have its outputs labeled as None, 1, >>> ..., len(outputs). This is broken because embedded payloads in composites >>> won't be able to reference the outputs if they differ from the returned >>> value. >>> >> >> Yes, we need a way for composites to declare their output tags. Currently >> this is only supported for the multi-output ParDo primitive. >> >> >>> In this case, having the restriction of no nesting decreases technical >>> complexity substantially and always giving the tag unambiguously informs >>> the SDK what to name the outputs. We can also allow for arbitrary nesting, >>> we'll just have to figure out an unambiguous naming scheme for the >>> PCollections. >>> >> >> How about this: if the returned PValue is a dictionary of string -> >> PCollection, we use the keys as tags. We can extend this naturally to >> tuples, named tuples, nesting, etc. (though I don't know if there are any >> hidden assumptions left about having an output labeled None if we want to >> push this through to completion). >> >> >>> >>> >>> >>> On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> I'm -1 on this, it adds additional restrictions and I don't see what >>>> this buys us. (In particular, it doesn't address the original issue.) >>>> >>>> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <[email protected]> wrote: >>>> >>>>> So then how does the proposal sound? >>>>> >>>>> Writing again here: >>>>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, >>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection], >>>>> DoOutputsTuple] >>>>> >>>>> i.e. no arbitrary nesting when outputting from an expand >>>>> >>>>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <[email protected]> wrote: >>>>>> > >>>>>> > It is important that composites know how things are named so that >>>>>> any embedded payloads in the composite PTransform can reference the >>>>>> outputs >>>>>> appropriately. >>>>>> >>>>>> Very good point. This is part of the cleanup to treat inputs and >>>>>> outputs of PCollections as maps rather than lists generally across the >>>>>> Python representations (which also relates to some of the ugliness >>>>>> that Cham has been running into with cross-language). >>>>>> >>>>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw < >>>>>> [email protected]> wrote: >>>>>> >> >>>>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <[email protected]> >>>>>> wrote: >>>>>> >> >>> >>>>>> >> >>> * Don't allow arbitrary nestings returned during expansion, >>>>>> force composite transforms to always provide an unambiguous name (either >>>>>> a >>>>>> tuple with PCollections with unique tags or a dictionary with untagged >>>>>> PCollections or a singular PCollection (Java and Go SDKs do this)). >>>>>> >> >> >>>>>> >> >> I believe that aligning with Java and Go would be the right way >>>>>> to go here. I don't know if this would limit expressiveness. >>>>>> >> > >>>>>> >> > Yeah this sounds like a much more elegant way of handling this >>>>>> situation. I would lean towards this limiting expressiveness because >>>>>> there >>>>>> would be a limit to nesting, but I think that the trade-off with reducing >>>>>> complexity is worth it. >>>>>> >> > >>>>>> >> > So in summary it could be: >>>>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str, >>>>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection], >>>>>> DoOutputsTuple] >>>>>> >> > >>>>>> >> > With the expectation that (pseudo-code): >>>>>> >> > a_transform = ATransform() >>>>>> >> > >>>>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() == >>>>>> a_transform.outputs.keys() >>>>>> >> > >>>>>> >> > Since this changes the Python SDK composite transform API, what >>>>>> would be the next steps for the community to come to a consensus on this? >>>>>> >> >>>>>> >> It seems here we're conflating the nesting of PValue results with >>>>>> the >>>>>> >> nesting of composite operations. >>>>>> >> >>>>>> >> Both examples in the original post have PTransform nesting (a >>>>>> >> composite) returning a flat tuple. This is completely orthogonal to >>>>>> >> the idea of a PTransform returning a nested result (such as (pc1, >>>>>> >> (pc2, pc3))) and forbidding the latter won't solve the former. >>>>>> >> >>>>>> >> Currently, with the exception of explicit names given for >>>>>> multi-output >>>>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, >>>>>> ... >>>>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no >>>>>> matter >>>>>> >> the nesting. We could do better, e.g. for the example above, label >>>>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but >>>>>> this >>>>>> >> is separate from the idea of trying to relate the output tags of >>>>>> >> composites to the output tags of their inner transforms. >>>>>> >> >>>>>> >> - Robert >>>>>> >>>>>
