Re: [QUESTION] Why no auto labels?

Joey Tran Mon, 02 Oct 2023 06:23:49 -0700

You don't have to specify the names if the callable you pass in is
/different/ for two `beam.Map`s, but  if the callable is the same you must
specify a label. For example, the below will raise an exception:


```
        | beam.Filter(identity_filter)
        | beam.Filter(identity_filter)
```

Here's an example on playground that shows the error message you get [1]. I
marked every line I added with a "# ++".

It's a contrived example, but using a map or filter at the same pipeline
level probably comes up often, at least in my inexperience. For example,
you. might have a pipeline that partitions a pcoll into three different
pcolls, runs some transforms on them, and then runs the same type of filter
on each of them.

The case that happens most often for me is using the `assert_that` [2]
testing transform. In this case, I think often users will really have no
need for a disambiguating label as they're often just writing unit tests
that test a few different properties of their workflow.

[1] https://play.beam.apache.org/?sdk=python&shared=hIrm7jvCamW
[2]
https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that

On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <[email protected]>
wrote:

> If I understand the question correctly, you don't have to specify those
> names.
>
> As Reuven pointed out, it is probably a good idea so you have a stable /
> deterministic graph.
> But in the Python SDK, you can simply use pcollection | map_fn, instead
> of pcollection | 'Map' >> map_fn.
>
> See an example here
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>
>
> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran <[email protected]>
> wrote:
>
>> Hmm, I'm not sure what you mean by "updating pipelines in place". Can you
>> elaborate?
>>
>> I forgot to mention my question is posed from the context of a python SDK
>> user, and afaict, there doesn't seem to be an obvious way to autogenerate
>> names/labels. Hearing that the java SDK supports it makes me wonder if the
>> python SDK could support it as well though... (If so, I'd be happy to do
>> implement it). Currently, it's fairly tedious to have to name every
>> instance of a transform that you might reuse in a pipeline, e.g. when
>> reapplying the same Map on different pcollections.
>>
>> On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <[email protected]>
>> wrote:
>>
>>> Are you talking about transform names? The main reason was because for
>>> runners that support updating pipelines in place, the only way to do so
>>> safely is if the runner can perfectly identify which transforms in the new
>>> graph match the ones in the old graph. There's no good way to auto generate
>>> names that will stay stable across updates - even small changes to the
>>> pipeline might change the order of nodes in the graph, which could result
>>> in a corrupted update.
>>>
>>> However, if you don't care about update, Beam can auto generate these
>>> names for you! When you call PCollection.apply (if using BeamJava), simply
>>> omit the name parameter and Beam will auto generate a unique name for you.
>>>
>>> Reuven
>>>
>>> On Sat, Sep 30, 2023 at 11:54 AM Joey Tran <[email protected]>
>>> wrote:
>>>
>>>> After writing a few pipelines now, I keep getting tripped up from
>>>> accidentally have duplicate labels from using multiple of the same
>>>> transforms without labeling them. I figure this must be a common complaint,
>>>> so I was just curious, what the rationale behind this design was? My naive
>>>> thought off the top of my head is that it'd be more user friendly to just
>>>> auto increment duplicate transforms, but I figure I must be missing
>>>> something
>>>>
>>>> Cheers,
>>>> Joey
>>>>
>>>

Re: [QUESTION] Why no auto labels?

Reply via email to