In the same vein of reducing boilerplate to make a schema-aware transform, I've opened a small PR to relax the requirement of 3 methods: https://github.com/apache/beam/pull/30560
Would appreciate someone taking a look! On Thu, Jun 29, 2023 at 9:43 AM Ahmed Abualsaud <ahmedabuals...@google.com> wrote: > Can someone take a quick look at https://github.com/apache/beam/pull/27202? > If things look good, let's try getting it in before the release cut as I'm > also updating our cross-language documentation and would like to include > these changes. > > Thank you, > Ahmed > > On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax <re...@google.com> wrote: > >> The goal was to make schema transforms as efficient as hand-written >> coders. We found the avro encoding/decoding to often be quite inefficient, >> which is one reason we didn't use it for schemas. >> >> Our schema encoding is internal to Beam though, and not suggested for use >> external to a pipeline. For sources or sinks we still recommend using Avro >> (or proto). >> >> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud < >>> ahmedabuals...@google.com> wrote: >>> >>>> Thank you all for your input. I have a PR for the changes I mentioned >>>> in my initial email: https://github.com/apache/beam/pull/27202. Please >>>> review when you get a chance! >>>> >>>> > perhaps we should consider just going to something Avro for portable >>>> coding rather than something custom >>>> >>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row >>>> elements? We would still run into the problem Cham mentioned earlier (of >>>> making sure existing PTransform inputs/outputs are compatible with >>>> cross-language-valid types). >>>> >>> >>> I don't remember why Avro was rejected in favor of our own encoding >>> format, but it probably doesn't make sense to revisit that without >>> understanding the full history. I do know that avro versioning and diamond >>> dependencies cause a lot of pain for users and there's a concerted effort >>> to remove Avro from Beam core altogether. >>> >>> In any case, this is quite orthogonal to the proposal here which we >>> should move forward on. >>> >>> >>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis <byronel...@google.com> >>>> wrote: >>>> >>>>> Sure, I get that… though perhaps we should consider just going to >>>>> something Avro for portable coding rather than something custom. >>>>> >>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath < >>>>> chamik...@google.com> wrote: >>>>> >>>>>> Input/output PCollection types at least have to be portable Beam >>>>>> types [1] for cross-language to work. >>>>>> >>>>>> I think we restricted schema-aware transforms to PCollection<Row> >>>>>> since Row was expected to be an efficient replacement for arbitrary >>>>>> portable Beam types (not sure how true that is in practice currently). >>>>>> >>>>>> Thanks, >>>>>> Cham >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829 >>>>>> >>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <byronel...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Is it actually necessary for a PTransform that is configured via the >>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as >>>>>>> two >>>>>>> separate concerns and unnecessarily limiting. >>>>>>> >>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath < >>>>>>> chamik...@google.com> wrote: >>>>>>> >>>>>>>> +1 for the simplification. >>>>>>>> >>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw < >>>>>>>> rober...@google.com> wrote: >>>>>>>> >>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2) >>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way >>>>>>>>> that >>>>>>>>> allows for easy consumption from both Java and cross-language. >>>>>>>>> >>>>>>>> >>>>>>>> +1 but I think the hard part today is to convert existing >>>>>>>> PTransforms to be schema-aware transform compatible (for example, >>>>>>>> change >>>>>>>> input/output types and make sure parameters take Beam Schema compatible >>>>>>>> types). But this makes sense for new transforms. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis < >>>>>>>>> byronel...@google.com> wrote: >>>>>>>>> >>>>>>>>>> Or perhaps the other way around? If you have a Schema we can >>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, >>>>>>>>>> more >>>>>>>>>> DRY. >>>>>>>>>> >>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev < >>>>>>>>>> dev@beam.apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> +1 to this simplification, it's a historical artifact that >>>>>>>>>>> provides no value. >>>>>>>>>>> >>>>>>>>>>> I would love it if we also looked into ways to auto-generate the >>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform >>>>>>>>>>> takes a >>>>>>>>>>> small number of arguments, or uses the standard builder pattern...), >>>>>>>>>>> ideally with something as simple as adding a decorator to the >>>>>>>>>>> PTransform >>>>>>>>>>> itself. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev < >>>>>>>>>>> dev@beam.apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey everyone, >>>>>>>>>>>> >>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion >>>>>>>>>>>> service. From what I see, there may be a redundant step in >>>>>>>>>>>> developing >>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces: >>>>>>>>>>>> - SchemaTransformProvider [1] >>>>>>>>>>>> - A configuration object >>>>>>>>>>>> - SchemaTransform [2] >>>>>>>>>>>> >>>>>>>>>>>> The API is generally used like this: >>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and >>>>>>>>>>>> returns a SchemaTransform >>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according >>>>>>>>>>>> to the configuration >>>>>>>>>>>> >>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We >>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider >>>>>>>>>>>> return the >>>>>>>>>>>> PTransform directly. >>>>>>>>>>>> >>>>>>>>>>>> We can then remove the SchemaTransform class as it will be >>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in >>>>>>>>>>>> our API is >>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd >>>>>>>>>>>> just trim >>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)` >>>>>>>>>>>> will directly return the PTransform). >>>>>>>>>>>> >>>>>>>>>>>> I'd like to first mention that I was not involved in the design >>>>>>>>>>>> process of this API so I may be missing some information on why it >>>>>>>>>>>> was set >>>>>>>>>>>> up this way. >>>>>>>>>>>> >>>>>>>>>>>> A few developers already raised questions about how there's >>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java >>>>>>>>>>>> transform >>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I >>>>>>>>>>>> may be >>>>>>>>>>>> missing some information, but my assumption is this was designed >>>>>>>>>>>> to follow >>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO): >>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, >>>>>>>>>>>> with the newer SchemaTransformProvider API, we dropped a few >>>>>>>>>>>> methods and >>>>>>>>>>>> reduced the SchemaTransform class to have a generic >>>>>>>>>>>> buildTransform() >>>>>>>>>>>> method. See the example of PubsubReadSchemaTransformProvider [6], >>>>>>>>>>>> where the >>>>>>>>>>>> SchemaTransform interface and buildTransform method are >>>>>>>>>>>> implemented just to satisfy the requirement that >>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform. >>>>>>>>>>>> >>>>>>>>>>>> I'm bringing this up because if we are looking to encourage >>>>>>>>>>>> contribution to cross-language use cases, we should make it >>>>>>>>>>>> simpler and >>>>>>>>>>>> less convoluted to develop portable transforms. >>>>>>>>>>>> >>>>>>>>>>>> There are a number of SchemaTransforms already developed, but >>>>>>>>>>>> applying these changes to them should be straightforward. If >>>>>>>>>>>> people think >>>>>>>>>>>> this is a good idea, I can open a PR and implement them. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Ahmed >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java >>>>>>>>>>>> [2] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java >>>>>>>>>>>> [3] >>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138 >>>>>>>>>>>> [4] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java >>>>>>>>>>>> [5] >>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java >>>>>>>>>>>> [6] >>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137 >>>>>>>>>>>> >>>>>>>>>>>