This feels like something that maybe should be more explicit? Overloading the transform name to provide a unique stable id feels like perhaps too much magic... also maybe feels like this is leaking specific runner behavior? I get that it's convenient
On Wed, Oct 4, 2023 at 9:16 AM Robert Bradshaw via user < u...@beam.apache.org> wrote: > BeamJava and BeamPython have the exact same behavior: transform names > within must be distinct [1]. This is because we do not necessarily know at > pipeline construction time if the pipeline will be streaming or batch, or > if it will be updated in the future, so the decision was made to impose > this restriction up front. Both will auto-generate a name for you if one is > not given, but will do so deterministically (not depending on some global > context) to avoid potential update problems. > > [1] Note that this applies to the fully qualified transform name, so the > naming only has to be distinct within a composite transform (or at the top > level--the pipeline itself is isomorphic to a single composite transform). > > On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com> > wrote: > >> Cross posting this thread to dev@ to see if this is intentional behavior >> or if it's something worth changing for the python SDK >> >> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org> >> wrote: >> >>> That suggests the default label is created as that, which indeed causes >>> the duplication error. >>> >>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <joey.t...@schrodinger.com> >>> wrote: >>> >>>> Not sure what that suggests >>>> >>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <u...@beam.apache.org> >>>> wrote: >>>> >>>>> Looks like this is the current behaviour. If you have `t = >>>>> beam.Filter(identity_filter)`, `t.label` is defined as >>>>> `Filter(identity_filter)`. >>>>> >>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <joey.t...@schrodinger.com> >>>>> wrote: >>>>> >>>>>> You don't have to specify the names if the callable you pass in is >>>>>> /different/ for two `beam.Map`s, but if the callable is the same you >>>>>> must >>>>>> specify a label. For example, the below will raise an exception: >>>>>> >>>>>> ``` >>>>>> | beam.Filter(identity_filter) >>>>>> | beam.Filter(identity_filter) >>>>>> ``` >>>>>> >>>>>> Here's an example on playground that shows the error message you get >>>>>> [1]. I marked every line I added with a "# ++". >>>>>> >>>>>> It's a contrived example, but using a map or filter at the same >>>>>> pipeline level probably comes up often, at least in my inexperience. For >>>>>> example, you. might have a pipeline that partitions a pcoll into three >>>>>> different pcolls, runs some transforms on them, and then runs the same >>>>>> type >>>>>> of filter on each of them. >>>>>> >>>>>> The case that happens most often for me is using the `assert_that` >>>>>> [2] testing transform. In this case, I think often users will really have >>>>>> no need for a disambiguating label as they're often just writing unit >>>>>> tests >>>>>> that test a few different properties of their workflow. >>>>>> >>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW >>>>>> [2] >>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that >>>>>> >>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user < >>>>>> u...@beam.apache.org> wrote: >>>>>> >>>>>>> If I understand the question correctly, you don't have to specify >>>>>>> those names. >>>>>>> >>>>>>> As Reuven pointed out, it is probably a good idea so you have a >>>>>>> stable / deterministic graph. >>>>>>> But in the Python SDK, you can simply use pcollection | map_fn, >>>>>>> instead of pcollection | 'Map' >> map_fn. >>>>>>> >>>>>>> See an example here >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116 >>>>>>> >>>>>>> >>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <joey.t...@schrodinger.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place". >>>>>>>> Can you elaborate? >>>>>>>> >>>>>>>> I forgot to mention my question is posed from the context of a >>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way to >>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it makes >>>>>>>> me >>>>>>>> wonder if the python SDK could support it as well though... (If so, >>>>>>>> I'd be >>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to >>>>>>>> name >>>>>>>> every instance of a transform that you might reuse in a pipeline, e.g. >>>>>>>> when >>>>>>>> reapplying the same Map on different pcollections. >>>>>>>> >>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user < >>>>>>>> u...@beam.apache.org> wrote: >>>>>>>> >>>>>>>>> Are you talking about transform names? The main reason was because >>>>>>>>> for runners that support updating pipelines in place, the only way to >>>>>>>>> do so >>>>>>>>> safely is if the runner can perfectly identify which transforms in >>>>>>>>> the new >>>>>>>>> graph match the ones in the old graph. There's no good way to auto >>>>>>>>> generate >>>>>>>>> names that will stay stable across updates - even small changes to the >>>>>>>>> pipeline might change the order of nodes in the graph, which could >>>>>>>>> result >>>>>>>>> in a corrupted update. >>>>>>>>> >>>>>>>>> However, if you don't care about update, Beam can auto generate >>>>>>>>> these names for you! When you call PCollection.apply (if using >>>>>>>>> BeamJava), >>>>>>>>> simply omit the name parameter and Beam will auto generate a unique >>>>>>>>> name >>>>>>>>> for you. >>>>>>>>> >>>>>>>>> Reuven >>>>>>>>> >>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran < >>>>>>>>> joey.t...@schrodinger.com> wrote: >>>>>>>>> >>>>>>>>>> After writing a few pipelines now, I keep getting tripped up from >>>>>>>>>> accidentally have duplicate labels from using multiple of the same >>>>>>>>>> transforms without labeling them. I figure this must be a common >>>>>>>>>> complaint, >>>>>>>>>> so I was just curious, what the rationale behind this design was? My >>>>>>>>>> naive >>>>>>>>>> thought off the top of my head is that it'd be more user friendly to >>>>>>>>>> just >>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing >>>>>>>>>> something >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Joey >>>>>>>>>> >>>>>>>>>