Sure, I get that… though perhaps we should consider just going to something Avro for portable coding rather than something custom.
On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <chamik...@google.com> wrote: > Input/output PCollection types at least have to be portable Beam types [1] > for cross-language to work. > > I think we restricted schema-aware transforms to PCollection<Row> since > Row was expected to be an efficient replacement for arbitrary portable Beam > types (not sure how true that is in practice currently). > > Thanks, > Cham > > [1] > https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829 > > On Tue, May 30, 2023 at 1:54 PM Byron Ellis <byronel...@google.com> wrote: > >> Is it actually necessary for a PTransform that is configured via the >> Schema mechanism to also be one that uses RowCoder? Those strike me as two >> separate concerns and unnecessarily limiting. >> >> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> +1 for the simplification. >>> >>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Yeah. Essentially one needs do (1) name the arguments and (2) implement >>>> the transform. Hopefully (1) could be done in a concise way that allows for >>>> easy consumption from both Java and cross-language. >>>> >>> >>> +1 but I think the hard part today is to convert existing PTransforms to >>> be schema-aware transform compatible (for example, change input/output >>> types and make sure parameters take Beam Schema compatible types). But this >>> makes sense for new transforms. >>> >>> >>> >>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <byronel...@google.com> >>>> wrote: >>>> >>>>> Or perhaps the other way around? If you have a Schema we can >>>>> auto-generate the associated builder on the PTransform? Either way, more >>>>> DRY. >>>>> >>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev < >>>>> dev@beam.apache.org> wrote: >>>>> >>>>>> +1 to this simplification, it's a historical artifact that provides >>>>>> no value. >>>>>> >>>>>> I would love it if we also looked into ways to auto-generate the >>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a >>>>>> small number of arguments, or uses the standard builder pattern...), >>>>>> ideally with something as simple as adding a decorator to the PTransform >>>>>> itself. >>>>>> >>>>>> >>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev < >>>>>> dev@beam.apache.org> wrote: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> I was looking at how we use SchemaTransforms in our expansion >>>>>>> service. From what I see, there may be a redundant step in developing >>>>>>> SchemaTransforms. Currently, we have 3 pieces: >>>>>>> - SchemaTransformProvider [1] >>>>>>> - A configuration object >>>>>>> - SchemaTransform [2] >>>>>>> >>>>>>> The API is generally used like this: >>>>>>> 1. The SchemaTransformProvider takes a configuration object and >>>>>>> returns a SchemaTransform >>>>>>> 2. The SchemaTransform is used to build a PTransform according to >>>>>>> the configuration >>>>>>> >>>>>>> In these steps, the SchemaTransform class seems unnecessary. We can >>>>>>> combine the two steps if we have SchemaTransformProvider return the >>>>>>> PTransform directly. >>>>>>> >>>>>>> We can then remove the SchemaTransform class as it will be obsolete. >>>>>>> This should be safe to do; the only place it's used in our API is here >>>>>>> [3], >>>>>>> and that can be simplified if we make this change (we'd just trim ` >>>>>>> .buildTransform()` off the end as `provider.from(configRow)` will >>>>>>> directly return the PTransform). >>>>>>> >>>>>>> I'd like to first mention that I was not involved in the design >>>>>>> process of this API so I may be missing some information on why it was >>>>>>> set >>>>>>> up this way. >>>>>>> >>>>>>> A few developers already raised questions about how there's >>>>>>> seemingly unnecessary boilerplate involved in making a Java transform >>>>>>> portable. I wasn't involved in the design process of this API so I may >>>>>>> be >>>>>>> missing some information, but my assumption is this was designed to >>>>>>> follow >>>>>>> the pattern of the previous iteration of this API (SchemaIO): >>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with the >>>>>>> newer SchemaTransformProvider API, we dropped a few methods and reduced >>>>>>> the >>>>>>> SchemaTransform class to have a generic buildTransform() method. See the >>>>>>> example of PubsubReadSchemaTransformProvider [6], where the >>>>>>> SchemaTransform interface and buildTransform method are implemented >>>>>>> just to satisfy the requirement that SchemaTransformProvider::from >>>>>>> return a SchemaTransform. >>>>>>> >>>>>>> I'm bringing this up because if we are looking to encourage >>>>>>> contribution to cross-language use cases, we should make it simpler and >>>>>>> less convoluted to develop portable transforms. >>>>>>> >>>>>>> There are a number of SchemaTransforms already developed, but >>>>>>> applying these changes to them should be straightforward. If people >>>>>>> think >>>>>>> this is a good idea, I can open a PR and implement them. >>>>>>> >>>>>>> Best, >>>>>>> Ahmed >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java >>>>>>> [2] >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java >>>>>>> [3] >>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138 >>>>>>> [4] >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java >>>>>>> [5] >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java >>>>>>> [6] >>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137 >>>>>>> >>>>>>