Re: Proposal to reduce the steps to make a Java transform portable

Ahmed Abualsaud via dev Thu, 07 Mar 2024 09:18:05 -0800

In the same vein of reducing boilerplate to make a schema-aware transform,
I've opened a small PR to relax the requirement of 3 methods:
https://github.com/apache/beam/pull/30560


Would appreciate someone taking a look!

On Thu, Jun 29, 2023 at 9:43 AM Ahmed Abualsaud <[email protected]>
wrote:

> Can someone take a quick look at https://github.com/apache/beam/pull/27202?
> If things look good, let's try getting it in before the release cut as I'm
> also updating our cross-language documentation and would like to include
> these changes.
>
> Thank you,
> Ahmed
>
> On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <[email protected]> wrote:
>
>> The goal was to make schema transforms as efficient as hand-written
>> coders. We found the avro encoding/decoding to often be quite inefficient,
>> which is one reason we didn't use it for schemas.
>>
>> Our schema encoding is internal to Beam though, and not suggested for use
>> external to a pipeline. For sources or sinks we still recommend using Avro
>> (or proto).
>>
>> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <[email protected]>
>> wrote:
>>
>>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>>> [email protected]> wrote:
>>>
>>>> Thank you all for your input. I have a PR for the changes I mentioned
>>>> in my initial email: https://github.com/apache/beam/pull/27202. Please
>>>> review when you get a chance!
>>>>
>>>> > perhaps we should consider just going to something Avro for portable
>>>> coding rather than something custom
>>>>
>>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>>> elements? We would still run into the problem Cham mentioned earlier (of
>>>> making sure existing PTransform inputs/outputs are compatible with
>>>> cross-language-valid types).
>>>>
>>>
>>> I don't remember why Avro was rejected in favor of our own encoding
>>> format, but it probably doesn't make sense to revisit that without
>>> understanding the full history. I do know that avro versioning and diamond
>>> dependencies cause a lot of pain for users and there's a concerted effort
>>> to remove Avro from Beam core altogether.
>>>
>>> In any case, this is quite orthogonal to the proposal here which we
>>> should move forward on.
>>>
>>>
>>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <[email protected]>
>>>> wrote:
>>>>
>>>>> Sure, I get that… though perhaps we should consider just going to
>>>>> something Avro for portable coding rather than something custom.
>>>>>
>>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Input/output PCollection types at least have to be portable Beam
>>>>>> types [1] for cross-language to work.
>>>>>>
>>>>>> I think we restricted schema-aware transforms to PCollection<Row>
>>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>>
>>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as 
>>>>>>> two
>>>>>>> separate concerns and unnecessarily limiting.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> +1 for the simplification.
>>>>>>>>
>>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way 
>>>>>>>>> that
>>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>>
>>>>>>>>
>>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>>> PTransforms to be schema-aware transform compatible (for example, 
>>>>>>>> change
>>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>>> types). But this makes sense for new transforms.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, 
>>>>>>>>>> more
>>>>>>>>>> DRY.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>>> provides no value.
>>>>>>>>>>>
>>>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform 
>>>>>>>>>>> takes a
>>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>>> ideally with something as simple as adding a decorator to the 
>>>>>>>>>>> PTransform
>>>>>>>>>>> itself.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>>> service. From what I see, there may be a redundant step in 
>>>>>>>>>>>> developing
>>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>>> - A configuration object
>>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>>
>>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>>> to the configuration
>>>>>>>>>>>>
>>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider 
>>>>>>>>>>>> return the
>>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>>
>>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in 
>>>>>>>>>>>> our API is
>>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd 
>>>>>>>>>>>> just trim
>>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>>>> process of this API so I may be missing some information on why it 
>>>>>>>>>>>> was set
>>>>>>>>>>>> up this way.
>>>>>>>>>>>>
>>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java 
>>>>>>>>>>>> transform
>>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I 
>>>>>>>>>>>> may be
>>>>>>>>>>>> missing some information, but my assumption is this was designed 
>>>>>>>>>>>> to follow
>>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However,
>>>>>>>>>>>> with the newer SchemaTransformProvider API, we dropped a few 
>>>>>>>>>>>> methods and
>>>>>>>>>>>> reduced the SchemaTransform class to have a generic 
>>>>>>>>>>>> buildTransform()
>>>>>>>>>>>> method. See the example of PubsubReadSchemaTransformProvider [6], 
>>>>>>>>>>>> where the
>>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>>> contribution to cross-language use cases, we should make it 
>>>>>>>>>>>> simpler and
>>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>>> applying these changes to them should be straightforward. If 
>>>>>>>>>>>> people think
>>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ahmed
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>>> [3]
>>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>>> [6]
>>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Proposal to reduce the steps to make a Java transform portable

Reply via email to