That makes sense. Would you suggest the new option simply suppress the
RuntimeError and use the non-unique label? Are there places on the SDK side
that expect unique labels? Or in non-updateable runners?

Would making `--auto_unique_labels` a mutually exclusive option with
streaming be a reasonable option? Not sure if stream/batch-specific options
are a good idea or not... but if there's precedent, then it'd be an easy
option

On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <rober...@google.com>
wrote:

>
> Thanks for the PR.
>
> I think we should follow Java and allow non-unique labels, but not provide
> automatic uniquification, In particular, the danger of using a counter is
> that one can get accidental (and potentially hard to check) off-by-one
> collisions. As a concrete example, imagine one partitions a dataset into
> two collections, each followed by a similarly-named transform.
>
>     --> B
>   /
> A
>  \
>    --> B
>
> Uniquification would give something like
>
>     --> B
>   /
> A
>  \
>    --> B_2
>
> Suppose one then realizes there's a third case to handle, giving
>
>     --> B
>   /
> A --> B
>  \
>    --> B
>
> But this would be uniquified to
>
>     --> B
>   /
> A --> B_2
>  \
>    --> B_3
>
> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
> This is bad because an updating runner would then attribute old B_2's state
> to the new B_2 (and also possibly mis-direct any inflight messages). At
> least with the old, intersecting names we can detect this problem
> rather than silently give corrupt data.
>
>
> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> For posterity: https://github.com/apache/beam/pull/28984
>>
>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> I would definitely support a PR making this an option. Changing the
>>> default would be a rather big change that would require more thought.
>>>
>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com>
>>> wrote:
>>>
>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
>>>> Apache Beam at my company and I'm trying to foresee parts of the API they
>>>> might find inconvenient.
>>>>
>>>> If there's a conclusion to make the behavior similar to java, I'm happy
>>>> to put up a PR
>>>>
>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com>
>>>> wrote:
>>>>
>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd
>>>>> be a very sticky toggle since it'd be easy for PTransforms to accidentally
>>>>> rely on it.
>>>>>
>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Huh. This used to be a hard error in Java, but I guess it's togglable
>>>>>> with an option now. We should probably add the option to toggle Python 
>>>>>> too.
>>>>>> (Unclear what the default should be, but this probably ties into
>>>>>> re-thinking how pipeline update should work.)
>>>>>>
>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>> wordcount example[1] to do the regex filter twice in a row, and unlike 
>>>>>>> the
>>>>>>> BeamPython example I posted before, it just warns instead of throwing an
>>>>>>> exception.
>>>>>>>
>>>>>>> Tangentially, is it expected that the Beam playground examples don't
>>>>>>> have a way to see the outputs of a run example? I have a vague memory 
>>>>>>> that
>>>>>>> there used to be a way to navigate to an output file after it's 
>>>>>>> generated
>>>>>>> but not sure if I just dreamt that up. Playing with the examples, I 
>>>>>>> wasn't
>>>>>>> positive if my runs were actually succeeding or not based on the stdout
>>>>>>> alone.
>>>>>>>
>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>
>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>> transform names within must be distinct [1]. This is because we do not
>>>>>>>> necessarily know at pipeline construction time if the pipeline will be
>>>>>>>> streaming or batch, or if it will be updated in the future, so the 
>>>>>>>> decision
>>>>>>>> was made to impose this restriction up front. Both will auto-generate a
>>>>>>>> name for you if one is not given, but will do so deterministically (not
>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>
>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>> so the naming only has to be distinct within a composite transform (or 
>>>>>>>> at
>>>>>>>> the top level--the pipeline itself is isomorphic to a single composite
>>>>>>>> transform).
>>>>>>>>
>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <joey.t...@schrodinger.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>
>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <u...@beam.apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> That suggests the default label is created as that, which indeed
>>>>>>>>>> causes the duplication error.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is 
>>>>>>>>>>>>> the same you
>>>>>>>>>>>>> must specify a label. For example, the below will raise an 
>>>>>>>>>>>>> exception:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my 
>>>>>>>>>>>>> inexperience.
>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll 
>>>>>>>>>>>>> into three
>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs the 
>>>>>>>>>>>>> same type
>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think often 
>>>>>>>>>>>>> users will
>>>>>>>>>>>>> really have no need for a disambiguating label as they're often 
>>>>>>>>>>>>> just
>>>>>>>>>>>>> writing unit tests that test a few different properties of their 
>>>>>>>>>>>>> workflow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have
>>>>>>>>>>>>>> a stable / deterministic graph.
>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context of
>>>>>>>>>>>>>>> a python SDK user, and afaict, there doesn't seem to be an 
>>>>>>>>>>>>>>> obvious way to
>>>>>>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports 
>>>>>>>>>>>>>>> it makes me
>>>>>>>>>>>>>>> wonder if the python SDK could support it as well though... (If 
>>>>>>>>>>>>>>> so, I'd be
>>>>>>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to 
>>>>>>>>>>>>>>> have to name
>>>>>>>>>>>>>>> every instance of a transform that you might reuse in a 
>>>>>>>>>>>>>>> pipeline, e.g. when
>>>>>>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>>>>>>> because for runners that support updating pipelines in place, 
>>>>>>>>>>>>>>>> the only way
>>>>>>>>>>>>>>>> to do so safely is if the runner can perfectly identify which 
>>>>>>>>>>>>>>>> transforms in
>>>>>>>>>>>>>>>> the new graph match the ones in the old graph. There's no good 
>>>>>>>>>>>>>>>> way to auto
>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even 
>>>>>>>>>>>>>>>> small changes to
>>>>>>>>>>>>>>>> the pipeline might change the order of nodes in the graph, 
>>>>>>>>>>>>>>>> which could
>>>>>>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>> generate these names for you! When you call PCollection.apply 
>>>>>>>>>>>>>>>> (if using
>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto 
>>>>>>>>>>>>>>>> generate a
>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>> joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped
>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using 
>>>>>>>>>>>>>>>>> multiple of the same
>>>>>>>>>>>>>>>>> transforms without labeling them. I figure this must be a 
>>>>>>>>>>>>>>>>> common complaint,
>>>>>>>>>>>>>>>>> so I was just curious, what the rationale behind this design 
>>>>>>>>>>>>>>>>> was? My naive
>>>>>>>>>>>>>>>>> thought off the top of my head is that it'd be more user 
>>>>>>>>>>>>>>>>> friendly to just
>>>>>>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be 
>>>>>>>>>>>>>>>>> missing
>>>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Reply via email to