Re: Proposal to reduce the steps to make a Java transform portable

Robert Bradshaw via dev Thu, 22 Jun 2023 16:15:26 -0700

On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <[email protected]>
wrote:


> Thank you all for your input. I have a PR for the changes I mentioned in
> my initial email: https://github.com/apache/beam/pull/27202. Please
> review when you get a chance!
>
> > perhaps we should consider just going to something Avro for portable
> coding rather than something custom
>
> Did you mean using some Avro object (GenericRecord?) besides Beam Row
> elements? We would still run into the problem Cham mentioned earlier (of
> making sure existing PTransform inputs/outputs are compatible with
> cross-language-valid types).
>

I don't remember why Avro was rejected in favor of our own encoding format,
but it probably doesn't make sense to revisit that without
understanding the full history. I do know that avro versioning and diamond
dependencies cause a lot of pain for users and there's a concerted effort
to remove Avro from Beam core altogether.

In any case, this is quite orthogonal to the proposal here which we should
move forward on.


> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <[email protected]>
> wrote:
>
>> Sure, I get that… though perhaps we should consider just going to
>> something Avro for portable coding rather than something custom.
>>
>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>> Input/output PCollection types at least have to be portable Beam types
>>> [1] for cross-language to work.
>>>
>>> I think we restricted schema-aware transforms to PCollection<Row> since
>>> Row was expected to be an efficient replacement for arbitrary portable Beam
>>> types (not sure how true that is in practice currently).
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>
>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <[email protected]>
>>> wrote:
>>>
>>>> Is it actually necessary for a PTransform that is configured via the
>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>>> separate concerns and unnecessarily limiting.
>>>>
>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>>
>>>>> +1 for the simplification.
>>>>>
>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>> implement the transform. Hopefully (1) could be done in a concise way 
>>>>>> that
>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>
>>>>>
>>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>>> to be schema-aware transform compatible (for example, change input/output
>>>>> types and make sure parameters take Beam Schema compatible types). But 
>>>>> this
>>>>> makes sense for new transforms.
>>>>>
>>>>>
>>>>>
>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>>> DRY.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> +1 to this simplification, it's a historical artifact that provides
>>>>>>>> no value.
>>>>>>>>
>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>> ideally with something as simple as adding a decorator to the 
>>>>>>>> PTransform
>>>>>>>> itself.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>> - A configuration object
>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>
>>>>>>>>> The API is generally used like this:
>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>> returns a SchemaTransform
>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>>>> the configuration
>>>>>>>>>
>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>> can combine the two steps if we have SchemaTransformProvider return 
>>>>>>>>> the
>>>>>>>>> PTransform directly.
>>>>>>>>>
>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>> obsolete. This should be safe to do; the only place it's used in our 
>>>>>>>>> API is
>>>>>>>>> here [3], and that can be simplified if we make this change (we'd 
>>>>>>>>> just trim
>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>> will directly return the PTransform).
>>>>>>>>>
>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>> process of this API so I may be missing some information on why it 
>>>>>>>>> was set
>>>>>>>>> up this way.
>>>>>>>>>
>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform
>>>>>>>>> portable. I wasn't involved in the design process of this API so I 
>>>>>>>>> may be
>>>>>>>>> missing some information, but my assumption is this was designed to 
>>>>>>>>> follow
>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with
>>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and 
>>>>>>>>> reduced
>>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. 
>>>>>>>>> See
>>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the
>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>
>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>> contribution to cross-language use cases, we should make it simpler 
>>>>>>>>> and
>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>
>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>> applying these changes to them should be straightforward. If people 
>>>>>>>>> think
>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Ahmed
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>> [2]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>> [3]
>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>> [4]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>> [5]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>> [6]
>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>
>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Reply via email to