Re: [QUESTION] Why no auto labels?

Joey Tran Fri, 13 Oct 2023 10:33:16 -0700

On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw <[email protected]> wrote:


> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran <[email protected]>
> wrote:
>
Are there places on the SDK side that expect unique labels? Or in
>> non-updateable runners?
>>
>
> That's a good question. The label eventually ends up here:
> https://github.com/apache/beam/blob/release-2.51.0/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L142
> which is technically a violation of the spec if these are not unique
> (though I guess Java already does this?). More importantly, though, the id
> of the transform (the proto contains a map of strings -> transforms) must
> be unique.
>
>
Doesn't that suggest not raising an exception won't be a sufficient enough
solution? But I suppose if the Java SDK does this already it's fine..?


> Another option is to make the suffix a uuid rather than a single counter.
> (This would still have issues with the first application possibly getting
> mixed up with a "different" first application unless it was always
> appended.)
>
>
Is the marginal benefit (*only* the first application may be confused
instead of possibly other applications) valuable enough to justify the
marginal cost of the readability hit of adding uuids to the labels?




On Fri, Oct 13, 2023 at 12:52 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>>
>>> Thanks for the PR.
>>>
>>> I think we should follow Java and allow non-unique labels, but not
>>> provide automatic uniquification, In particular, the danger of using a
>>> counter is that one can get accidental (and potentially hard to check)
>>> off-by-one collisions. As a concrete example, imagine one partitions a
>>> dataset into two collections, each followed by a similarly-named transform.
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B
>>>
>>> Uniquification would give something like
>>>
>>>     --> B
>>>   /
>>> A
>>>  \
>>>    --> B_2
>>>
>>> Suppose one then realizes there's a third case to handle, giving
>>>
>>>     --> B
>>>   /
>>> A --> B
>>>  \
>>>    --> B
>>>
>>> But this would be uniquified to
>>>
>>>     --> B
>>>   /
>>> A --> B_2
>>>  \
>>>    --> B_3
>>>
>>> where the old B_2 got renamed to B_3 and a new B_2 got put in its place.
>>> This is bad because an updating runner would then attribute old B_2's state
>>> to the new B_2 (and also possibly mis-direct any inflight messages). At
>>> least with the old, intersecting names we can detect this problem
>>> rather than silently give corrupt data.
>>>
>>>
>>> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran <[email protected]>
>>> wrote:
>>>
>>>> For posterity: https://github.com/apache/beam/pull/28984
>>>>
>>>> On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> I would definitely support a PR making this an option. Changing the
>>>>> default would be a rather big change that would require more thought.
>>>>>
>>>>> On Tue, Oct 10, 2023 at 4:24 PM Joey Tran <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Bump on this. Sorry to pester - I'm trying to get a few teams to
>>>>>> adopt Apache Beam at my company and I'm trying to foresee parts of the 
>>>>>> API
>>>>>> they might find inconvenient.
>>>>>>
>>>>>> If there's a conclusion to make the behavior similar to java, I'm
>>>>>> happy to put up a PR
>>>>>>
>>>>>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Is it really toggleable in Java? I imagine that if it's a toggle
>>>>>>> it'd be a very sticky toggle since it'd be easy for PTransforms to
>>>>>>> accidentally rely on it.
>>>>>>>
>>>>>>> On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Huh. This used to be a hard error in Java, but I guess it's
>>>>>>>> togglable with an option now. We should probably add the option to 
>>>>>>>> toggle
>>>>>>>> Python too. (Unclear what the default should be, but this probably ties
>>>>>>>> into re-thinking how pipeline update should work.)
>>>>>>>>
>>>>>>>> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Makes sense that the requirement is the same, but is the label
>>>>>>>>> auto-generation behavior the same? I modified the BeamJava
>>>>>>>>> wordcount example[1] to do the regex filter twice in a row, and 
>>>>>>>>> unlike the
>>>>>>>>> BeamPython example I posted before, it just warns instead of throwing 
>>>>>>>>> an
>>>>>>>>> exception.
>>>>>>>>>
>>>>>>>>> Tangentially, is it expected that the Beam playground examples
>>>>>>>>> don't have a way to see the outputs of a run example? I have a vague 
>>>>>>>>> memory
>>>>>>>>> that there used to be a way to navigate to an output file after it's
>>>>>>>>> generated but not sure if I just dreamt that up. Playing with the 
>>>>>>>>> examples,
>>>>>>>>> I wasn't positive if my runs were actually succeeding or not based on 
>>>>>>>>> the
>>>>>>>>> stdout alone.
>>>>>>>>>
>>>>>>>>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>>>>>>>>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>>>>>>>>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>
>>>>>>>>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> BeamJava and BeamPython have the exact same behavior:
>>>>>>>>>> transform names within must be distinct [1]. This is because we do 
>>>>>>>>>> not
>>>>>>>>>> necessarily know at pipeline construction time if the pipeline will 
>>>>>>>>>> be
>>>>>>>>>> streaming or batch, or if it will be updated in the future, so the 
>>>>>>>>>> decision
>>>>>>>>>> was made to impose this restriction up front. Both will 
>>>>>>>>>> auto-generate a
>>>>>>>>>> name for you if one is not given, but will do so deterministically 
>>>>>>>>>> (not
>>>>>>>>>> depending on some global context) to avoid potential update problems.
>>>>>>>>>>
>>>>>>>>>> [1] Note that this applies to the fully qualified transform name,
>>>>>>>>>> so the naming only has to be distinct within a composite transform 
>>>>>>>>>> (or at
>>>>>>>>>> the top level--the pipeline itself is isomorphic to a single 
>>>>>>>>>> composite
>>>>>>>>>> transform).
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Cross posting this thread to dev@ to see if this is intentional
>>>>>>>>>>> behavior or if it's something worth changing for the python SDK
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> That suggests the default label is created as that, which
>>>>>>>>>>>> indeed causes the duplication error.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure what that suggests
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>>>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>>>>>>>>> `Filter(identity_filter)`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You don't have to specify the names if the callable you pass
>>>>>>>>>>>>>>> in is /different/ for two `beam.Map`s, but  if the callable is 
>>>>>>>>>>>>>>> the same you
>>>>>>>>>>>>>>> must specify a label. For example, the below will raise an 
>>>>>>>>>>>>>>> exception:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here's an example on playground that shows the error message
>>>>>>>>>>>>>>> you get [1]. I marked every line I added with a "# ++".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a contrived example, but using a map or filter at the
>>>>>>>>>>>>>>> same pipeline level probably comes up often, at least in my 
>>>>>>>>>>>>>>> inexperience.
>>>>>>>>>>>>>>> For example, you. might have a pipeline that partitions a pcoll 
>>>>>>>>>>>>>>> into three
>>>>>>>>>>>>>>> different pcolls, runs some transforms on them, and then runs 
>>>>>>>>>>>>>>> the same type
>>>>>>>>>>>>>>> of filter on each of them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The case that happens most often for me is using the
>>>>>>>>>>>>>>> `assert_that` [2] testing transform. In this case, I think 
>>>>>>>>>>>>>>> often users will
>>>>>>>>>>>>>>> really have no need for a disambiguating label as they're often 
>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>> writing unit tests that test a few different properties of 
>>>>>>>>>>>>>>> their workflow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If I understand the question correctly, you don't have to
>>>>>>>>>>>>>>>> specify those names.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As Reuven pointed out, it is probably a good idea so you
>>>>>>>>>>>>>>>> have a stable / deterministic graph.
>>>>>>>>>>>>>>>> But in the Python SDK, you can simply use pcollection |
>>>>>>>>>>>>>>>> map_fn, instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See an example here
>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in
>>>>>>>>>>>>>>>>> place". Can you elaborate?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I forgot to mention my question is posed from the context
>>>>>>>>>>>>>>>>> of a python SDK user, and afaict, there doesn't seem to be an 
>>>>>>>>>>>>>>>>> obvious way
>>>>>>>>>>>>>>>>> to autogenerate names/labels. Hearing that the java SDK 
>>>>>>>>>>>>>>>>> supports it makes
>>>>>>>>>>>>>>>>> me wonder if the python SDK could support it as well 
>>>>>>>>>>>>>>>>> though... (If so, I'd
>>>>>>>>>>>>>>>>> be happy to do implement it). Currently, it's fairly tedious 
>>>>>>>>>>>>>>>>> to have to
>>>>>>>>>>>>>>>>> name every instance of a transform that you might reuse in a 
>>>>>>>>>>>>>>>>> pipeline, e.g.
>>>>>>>>>>>>>>>>> when reapplying the same Map on different pcollections.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are you talking about transform names? The main reason
>>>>>>>>>>>>>>>>>> was because for runners that support updating pipelines in 
>>>>>>>>>>>>>>>>>> place, the only
>>>>>>>>>>>>>>>>>> way to do so safely is if the runner can perfectly identify 
>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>> transforms in the new graph match the ones in the old graph. 
>>>>>>>>>>>>>>>>>> There's no
>>>>>>>>>>>>>>>>>> good way to auto generate names that will stay stable across 
>>>>>>>>>>>>>>>>>> updates - even
>>>>>>>>>>>>>>>>>> small changes to the pipeline might change the order of 
>>>>>>>>>>>>>>>>>> nodes in the graph,
>>>>>>>>>>>>>>>>>> which could result in a corrupted update.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> However, if you don't care about update, Beam can auto
>>>>>>>>>>>>>>>>>> generate these names for you! When you call 
>>>>>>>>>>>>>>>>>> PCollection.apply (if using
>>>>>>>>>>>>>>>>>> BeamJava), simply omit the name parameter and Beam will auto 
>>>>>>>>>>>>>>>>>> generate a
>>>>>>>>>>>>>>>>>> unique name for you.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After writing a few pipelines now, I keep getting
>>>>>>>>>>>>>>>>>>> tripped up from accidentally have duplicate labels from 
>>>>>>>>>>>>>>>>>>> using multiple of
>>>>>>>>>>>>>>>>>>> the same transforms without labeling them. I figure this 
>>>>>>>>>>>>>>>>>>> must be a common
>>>>>>>>>>>>>>>>>>> complaint, so I was just curious, what the rationale behind 
>>>>>>>>>>>>>>>>>>> this design
>>>>>>>>>>>>>>>>>>> was? My naive thought off the top of my head is that it'd 
>>>>>>>>>>>>>>>>>>> be more user
>>>>>>>>>>>>>>>>>>> friendly to just auto increment duplicate transforms, but I 
>>>>>>>>>>>>>>>>>>> figure I must
>>>>>>>>>>>>>>>>>>> be missing something
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Joey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Reply via email to