I'll take another look at the PR. My inclination is still to use uuids
to uniquify. I think that's worth the cost to the readability hit (I'm
OK reducing this down to 6-8 hex digits which will still give very low
chances of collisions, though it doesn't solve the first one). If
someone cares about names more than this they can set them manually.

On Fri, Oct 20, 2023 at 9:30 AM Joey Tran <joey.t...@schrodinger.com> wrote:
>
> Just want to bump this discussion again. I'm introducing Beam to other 
> developers at my Schrodinger now and the first (of hopefully many!) developer 
> has started migrating our internal workflows to Beam. As I suspected though, 
> he's complained about the iteration cycles spent from using the same 
> transform without specifying a label and from using multiple assert_that's 
> without unique labels.
>
> From the java code[1], it looks like the same naming scheme is used (I think, 
> it's been a decade since I've read java so apologies if I'm misreading) as 
> the PR I posted
>
> [1] 
> https://github.com/apache/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630
>
> On Fri, Oct 13, 2023 at 1:32 PM Joey Tran <joey.t...@schrodinger.com> wrote:
>>
>>
>>
>> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <rober...@google.com> wrote:
>>>
>>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <joey.t...@schrodinger.com> 
>>> wrote:
>>>>
>>>> Are there places on the SDK side that expect unique labels? Or in 
>>>> non-updateable runners?
>>>
>>>
>>> That's a good question. The label eventually ends up here: 
>>> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
>>>  which is technically a violation of the spec if these are not unique 
>>> (though I guess Java already does this?). More importantly, though, the id 
>>> of the transform (the proto contains a map of strings -> transforms) must 
>>> be unique.
>>>
>>
>> Doesn't that suggest not raising an exception won't be a sufficient enough 
>> solution? But I suppose if the Java SDK does this already it's fine..?
>>
>>>
>>> Another option is to make the suffix a uuid rather than a single counter. 
>>> (This would still have issues with the first application possibly getting 
>>> mixed up with a "different" first application unless it was always 
>>> appended.)
>>>
>>
>> Is the marginal benefit (only the first application may be confused instead 
>> of possibly other applications) valuable enough to justify the marginal cost 
>> of the readability hit of adding uuids to the labels?
>>
>>
>>
>>
>>>> On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <rober...@google.com> 
>>>> wrote:
>>>>>
>>>>>
>>>>> Thanks for the PR.
>>>>>
>>>>> I think we should follow Java and allow non-unique labels, but not 
>>>>> provide automatic uniquification, In particular, the danger of using a 
>>>>> counter is that one can get accidental (and potentially hard to check) 
>>>>> off-by-one collisions. As a concrete example, imagine one partitions a 
>>>>> dataset into two collections, each followed by a similarly-named 
>>>>> transform.
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> Uniquification would give something like
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A
>>>>>  \
>>>>>    --> B_2
>>>>>
>>>>> Suppose one then realizes there's a third case to handle, giving
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B
>>>>>  \
>>>>>    --> B
>>>>>
>>>>> But this would be uniquified to
>>>>>
>>>>>     --> B
>>>>>   /
>>>>> A --> B_2
>>>>>  \
>>>>>    --> B_3
>>>>>
>>>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place. 
>>>>> This is bad because an updating runner would then attribute old B_2's 
>>>>> state to the new B_2 (and also possibly mis-direct any inflight 
>>>>> messages). At least with the old, intersecting names we can detect this 
>>>>> problem rather than silently give corrupt data.
>>>>>
>>>>>
>>>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <joey.t...@schrodinger.com> 
>>>>> wrote:
>>>>>>
>>>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>>>
>>>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <rober...@google.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> I would definitely support a PR making this an option. Changing the 
>>>>>>> default would be a rather big change that would require more thought.
>>>>>>>
>>>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <joey.t...@schrodinger.com> 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to adopt 
>>>>>>>> Apache Beam at my company and I'm trying to foresee parts of the API 
>>>>>>>> they might find inconvenient.
>>>>>>>>
>>>>>>>> If there's a conclusion to make the behavior similar to java, I'm 
>>>>>>>> happy to put up a PR
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <joey.t...@schrodinger.com> 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd 
>>>>>>>>> be a very sticky toggle since it'd be easy for PTransforms to 
>>>>>>>>> accidentally rely on it.
>>>>>>>>>
>>>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's 
>>>>>>>>>> togglable with an option now. We should probably add the option to 
>>>>>>>>>> toggle Python too. (Unclear what the default should be, but this 
>>>>>>>>>> probably ties into re-thinking how pipeline update should work.)
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <joey.t...@schrodinger.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Makes sense that the requirement is the same, but is the label 
>>>>>>>>>>> auto-generation behavior the same? I modified the BeamJava 
>>>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and 
>>>>>>>>>>> unlike the BeamPython example I posted before, it just warns 
>>>>>>>>>>> instead of throwing an exception.
>>>>>>>>>>>
>>>>>>>>>>> Tangentially, is it expected that the Beam playground examples 
>>>>>>>>>>> don't have a way to see the outputs of a run example? I have a 
>>>>>>>>>>> vague memory that there used to be a way to navigate to an output 
>>>>>>>>>>> file after it's generated but not sure if I just dreamt that up. 
>>>>>>>>>>> Playing with the examples, I wasn't positive if my runs were 
>>>>>>>>>>> actually succeeding or not based on the stdout alone.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user 
>>>>>>>>>>> <u...@beam.apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> BeamJava and BeamPython have the exact same behavior: transform 
>>>>>>>>>>>> names within must be distinct [1]. This is because we do not 
>>>>>>>>>>>> necessarily know at pipeline construction time if the pipeline 
>>>>>>>>>>>> will be streaming or batch, or if it will be updated in the 
>>>>>>>>>>>> future, so the decision was made to impose this restriction up 
>>>>>>>>>>>> front. Both will auto-generate a name for you if one is not given, 
>>>>>>>>>>>> but will do so deterministically (not depending on some global 
>>>>>>>>>>>> context) to avoid potential update problems.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] Note that this applies to the fully qualified transform name, 
>>>>>>>>>>>> so the naming only has to be distinct within a composite transform 
>>>>>>>>>>>> (or at the top level--the pipeline itself is isomorphic to a 
>>>>>>>>>>>> single composite transform).
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran 
>>>>>>>>>>>> <joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional 
>>>>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user 
>>>>>>>>>>>>> <u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That suggests the default label is created as that, which indeed 
>>>>>>>>>>>>>> causes the duplication error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran 
>>>>>>>>>>>>>> <joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user 
>>>>>>>>>>>>>>> <u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t = 
>>>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as 
>>>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran 
>>>>>>>>>>>>>>>> <joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass 
>>>>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable 
>>>>>>>>>>>>>>>>> is the same you must specify a label. For example, the below 
>>>>>>>>>>>>>>>>> will raise an exception:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here's an example on playground that shows the error message 
>>>>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the 
>>>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my 
>>>>>>>>>>>>>>>>> inexperience. For example, you. might have a pipeline that 
>>>>>>>>>>>>>>>>> partitions a pcoll into three different pcolls, runs some 
>>>>>>>>>>>>>>>>> transforms on them, and then runs the same type of filter on 
>>>>>>>>>>>>>>>>> each of them.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The case that happens most often for me is using the 
>>>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think 
>>>>>>>>>>>>>>>>> often users will really have no need for a disambiguating 
>>>>>>>>>>>>>>>>> label as they're often just writing unit tests that test a 
>>>>>>>>>>>>>>>>> few different properties of their workflow.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] 
>>>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>>>> [2] 
>>>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user 
>>>>>>>>>>>>>>>>> <u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to 
>>>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you 
>>>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection | 
>>>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> See an example here 
>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran 
>>>>>>>>>>>>>>>>>> <joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in 
>>>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context 
>>>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be 
>>>>>>>>>>>>>>>>>>> an obvious way to autogenerate names/labels. Hearing that 
>>>>>>>>>>>>>>>>>>> the java SDK supports it makes me wonder if the python SDK 
>>>>>>>>>>>>>>>>>>> could support it as well though... (If so, I'd be happy to 
>>>>>>>>>>>>>>>>>>> do implement it). Currently, it's fairly tedious to have to 
>>>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in 
>>>>>>>>>>>>>>>>>>> a pipeline, e.g. when reapplying the same Map on different 
>>>>>>>>>>>>>>>>>>> pcollections.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user 
>>>>>>>>>>>>>>>>>>> <u...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason was 
>>>>>>>>>>>>>>>>>>>> because for runners that support updating pipelines in 
>>>>>>>>>>>>>>>>>>>> place, the only way to do so safely is if the runner can 
>>>>>>>>>>>>>>>>>>>> perfectly identify which transforms in the new graph match 
>>>>>>>>>>>>>>>>>>>> the ones in the old graph. There's no good way to auto 
>>>>>>>>>>>>>>>>>>>> generate names that will stay stable across updates - even 
>>>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of 
>>>>>>>>>>>>>>>>>>>> nodes in the graph, which could result in a corrupted 
>>>>>>>>>>>>>>>>>>>> update.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto 
>>>>>>>>>>>>>>>>>>>> generate these names for you! When you call 
>>>>>>>>>>>>>>>>>>>> PCollection.apply (if using BeamJava), simply omit the 
>>>>>>>>>>>>>>>>>>>> name parameter and Beam will auto generate a unique name 
>>>>>>>>>>>>>>>>>>>> for you.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran 
>>>>>>>>>>>>>>>>>>>> <joey.t...@schrodinger.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped 
>>>>>>>>>>>>>>>>>>>>> up from accidentally have duplicate labels from using 
>>>>>>>>>>>>>>>>>>>>> multiple of the same transforms without labeling them. I 
>>>>>>>>>>>>>>>>>>>>> figure this must be a common complaint, so I was just 
>>>>>>>>>>>>>>>>>>>>> curious, what the rationale behind this design was? My 
>>>>>>>>>>>>>>>>>>>>> naive thought off the top of my head is that it'd be more 
>>>>>>>>>>>>>>>>>>>>> user friendly to just auto increment duplicate 
>>>>>>>>>>>>>>>>>>>>> transforms, but I figure I must be missing something
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Joey

Reply via email to