Re: [QUESTION] Why no auto labels?

Joey Tran Thu, 05 Oct 2023 09:49:38 -0700

Is it really toggleable in Java? I imagine that if it's a toggle it'd be a
very sticky toggle since it'd be easy for PTransforms to accidentally rely
on it.


On Thu, Oct 5, 2023 at 12:33 PM Robert Bradshaw <[email protected]> wrote:

> Huh. This used to be a hard error in Java, but I guess it's togglable
> with an option now. We should probably add the option to toggle Python too.
> (Unclear what the default should be, but this probably ties into
> re-thinking how pipeline update should work.)
>
> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran <[email protected]>
> wrote:
>
>> Makes sense that the requirement is the same, but is the label
>> auto-generation behavior the same? I modified the BeamJava
>> wordcount example[1] to do the regex filter twice in a row, and unlike the
>> BeamPython example I posted before, it just warns instead of throwing an
>> exception.
>>
>> Tangentially, is it expected that the Beam playground examples don't have
>> a way to see the outputs of a run example? I have a vague memory that there
>> used to be a way to navigate to an output file after it's generated but not
>> sure if I just dreamt that up. Playing with the examples, I wasn't positive
>> if my runs were actually succeeding or not based on the stdout alone.
>>
>> [1] https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2
>> <https://play.beam.apache.org/?sdk=java&shared=mI7WUeje_r2>
>> [2] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>
>> On Wed, Oct 4, 2023 at 12:16 PM Robert Bradshaw via user <
>> [email protected]> wrote:
>>
>>> BeamJava and BeamPython have the exact same behavior: transform names
>>> within must be distinct [1]. This is because we do not necessarily know at
>>> pipeline construction time if the pipeline will be streaming or batch, or
>>> if it will be updated in the future, so the decision was made to impose
>>> this restriction up front. Both will auto-generate a name for you if one is
>>> not given, but will do so deterministically (not depending on some global
>>> context) to avoid potential update problems.
>>>
>>> [1] Note that this applies to the fully qualified transform name, so the
>>> naming only has to be distinct within a composite transform (or at the top
>>> level--the pipeline itself is isomorphic to a single composite transform).
>>>
>>> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran <[email protected]>
>>> wrote:
>>>
>>>> Cross posting this thread to dev@ to see if this is intentional
>>>> behavior or if it's something worth changing for the python SDK
>>>>
>>>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user <[email protected]>
>>>> wrote:
>>>>
>>>>> That suggests the default label is created as that, which indeed
>>>>> causes the duplication error.
>>>>>
>>>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Not sure what that suggests
>>>>>>
>>>>>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Looks like this is the current behaviour. If you have `t =
>>>>>>> beam.Filter(identity_filter)`, `t.label` is defined as
>>>>>>> `Filter(identity_filter)`.
>>>>>>>
>>>>>>> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You don't have to specify the names if the callable you pass in is
>>>>>>>> /different/ for two `beam.Map`s, but  if the callable is the same you 
>>>>>>>> must
>>>>>>>> specify a label. For example, the below will raise an exception:
>>>>>>>>
>>>>>>>> ```
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>>         | beam.Filter(identity_filter)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Here's an example on playground that shows the error message you
>>>>>>>> get [1]. I marked every line I added with a "# ++".
>>>>>>>>
>>>>>>>> It's a contrived example, but using a map or filter at the same
>>>>>>>> pipeline level probably comes up often, at least in my inexperience. 
>>>>>>>> For
>>>>>>>> example, you. might have a pipeline that partitions a pcoll into three
>>>>>>>> different pcolls, runs some transforms on them, and then runs the same 
>>>>>>>> type
>>>>>>>> of filter on each of them.
>>>>>>>>
>>>>>>>> The case that happens most often for me is using the `assert_that`
>>>>>>>> [2] testing transform. In this case, I think often users will really 
>>>>>>>> have
>>>>>>>> no need for a disambiguating label as they're often just writing unit 
>>>>>>>> tests
>>>>>>>> that test a few different properties of their workflow.
>>>>>>>>
>>>>>>>> [1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
>>>>>>>> [2]
>>>>>>>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>>>>>>>
>>>>>>>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> If I understand the question correctly, you don't have to specify
>>>>>>>>> those names.
>>>>>>>>>
>>>>>>>>> As Reuven pointed out, it is probably a good idea so you have a
>>>>>>>>> stable / deterministic graph.
>>>>>>>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>>>>>>>> instead of pcollection | 'Map' >> map_fn.
>>>>>>>>>
>>>>>>>>> See an example here
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hmm, I'm not sure what you mean by "updating pipelines in place".
>>>>>>>>>> Can you elaborate?
>>>>>>>>>>
>>>>>>>>>> I forgot to mention my question is posed from the context of a
>>>>>>>>>> python SDK user, and afaict, there doesn't seem to be an obvious way 
>>>>>>>>>> to
>>>>>>>>>> autogenerate names/labels. Hearing that the java SDK supports it 
>>>>>>>>>> makes me
>>>>>>>>>> wonder if the python SDK could support it as well though... (If so, 
>>>>>>>>>> I'd be
>>>>>>>>>> happy to do implement it). Currently, it's fairly tedious to have to 
>>>>>>>>>> name
>>>>>>>>>> every instance of a transform that you might reuse in a pipeline, 
>>>>>>>>>> e.g. when
>>>>>>>>>> reapplying the same Map on different pcollections.
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Are you talking about transform names? The main reason was
>>>>>>>>>>> because for runners that support updating pipelines in place, the 
>>>>>>>>>>> only way
>>>>>>>>>>> to do so safely is if the runner can perfectly identify which 
>>>>>>>>>>> transforms in
>>>>>>>>>>> the new graph match the ones in the old graph. There's no good way 
>>>>>>>>>>> to auto
>>>>>>>>>>> generate names that will stay stable across updates - even small 
>>>>>>>>>>> changes to
>>>>>>>>>>> the pipeline might change the order of nodes in the graph, which 
>>>>>>>>>>> could
>>>>>>>>>>> result in a corrupted update.
>>>>>>>>>>>
>>>>>>>>>>> However, if you don't care about update, Beam can auto generate
>>>>>>>>>>> these names for you! When you call PCollection.apply (if using 
>>>>>>>>>>> BeamJava),
>>>>>>>>>>> simply omit the name parameter and Beam will auto generate a unique 
>>>>>>>>>>> name
>>>>>>>>>>> for you.
>>>>>>>>>>>
>>>>>>>>>>> Reuven
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> After writing a few pipelines now, I keep getting tripped up
>>>>>>>>>>>> from accidentally have duplicate labels from using multiple of the 
>>>>>>>>>>>> same
>>>>>>>>>>>> transforms without labeling them. I figure this must be a common 
>>>>>>>>>>>> complaint,
>>>>>>>>>>>> so I was just curious, what the rationale behind this design was? 
>>>>>>>>>>>> My naive
>>>>>>>>>>>> thought off the top of my head is that it'd be more user friendly 
>>>>>>>>>>>> to just
>>>>>>>>>>>> auto increment duplicate transforms, but I figure I must be missing
>>>>>>>>>>>> something
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Joey
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [QUESTION] Why no auto labels?

Reply via email to