On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <ahmedabuals...@google.com> wrote:
> Thank you all for your input. I have a PR for the changes I mentioned in > my initial email: https://github.com/apache/beam/pull/27202. Please > review when you get a chance! > > > perhaps we should consider just going to something Avro for portable > coding rather than something custom > > Did you mean using some Avro object (GenericRecord?) besides Beam Row > elements? We would still run into the problem Cham mentioned earlier (of > making sure existing PTransform inputs/outputs are compatible with > cross-language-valid types). > I don't remember why Avro was rejected in favor of our own encoding format, but it probably doesn't make sense to revisit that without understanding the full history. I do know that avro versioning and diamond dependencies cause a lot of pain for users and there's a concerted effort to remove Avro from Beam core altogether. In any case, this is quite orthogonal to the proposal here which we should move forward on. > On Tue, May 30, 2023 at 10:53 PM Byron Ellis <byronel...@google.com> > wrote: > >> Sure, I get that… though perhaps we should consider just going to >> something Avro for portable coding rather than something custom. >> >> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> Input/output PCollection types at least have to be portable Beam types >>> [1] for cross-language to work. >>> >>> I think we restricted schema-aware transforms to PCollection<Row> since >>> Row was expected to be an efficient replacement for arbitrary portable Beam >>> types (not sure how true that is in practice currently). >>> >>> Thanks, >>> Cham >>> >>> [1] >>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829 >>> >>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis <byronel...@google.com> >>> wrote: >>> >>>> Is it actually necessary for a PTransform that is configured via the >>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two >>>> separate concerns and unnecessarily limiting. >>>> >>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath < >>>> chamik...@google.com> wrote: >>>> >>>>> +1 for the simplification. >>>>> >>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <rober...@google.com> >>>>> wrote: >>>>> >>>>>> Yeah. Essentially one needs do (1) name the arguments and (2) >>>>>> implement the transform. Hopefully (1) could be done in a concise way >>>>>> that >>>>>> allows for easy consumption from both Java and cross-language. >>>>>> >>>>> >>>>> +1 but I think the hard part today is to convert existing PTransforms >>>>> to be schema-aware transform compatible (for example, change input/output >>>>> types and make sure parameters take Beam Schema compatible types). But >>>>> this >>>>> makes sense for new transforms. >>>>> >>>>> >>>>> >>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <byronel...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Or perhaps the other way around? If you have a Schema we can >>>>>>> auto-generate the associated builder on the PTransform? Either way, more >>>>>>> DRY. >>>>>>> >>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev < >>>>>>> dev@beam.apache.org> wrote: >>>>>>> >>>>>>>> +1 to this simplification, it's a historical artifact that provides >>>>>>>> no value. >>>>>>>> >>>>>>>> I would love it if we also looked into ways to auto-generate the >>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a >>>>>>>> small number of arguments, or uses the standard builder pattern...), >>>>>>>> ideally with something as simple as adding a decorator to the >>>>>>>> PTransform >>>>>>>> itself. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev < >>>>>>>> dev@beam.apache.org> wrote: >>>>>>>> >>>>>>>>> Hey everyone, >>>>>>>>> >>>>>>>>> I was looking at how we use SchemaTransforms in our expansion >>>>>>>>> service. From what I see, there may be a redundant step in developing >>>>>>>>> SchemaTransforms. Currently, we have 3 pieces: >>>>>>>>> - SchemaTransformProvider [1] >>>>>>>>> - A configuration object >>>>>>>>> - SchemaTransform [2] >>>>>>>>> >>>>>>>>> The API is generally used like this: >>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and >>>>>>>>> returns a SchemaTransform >>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to >>>>>>>>> the configuration >>>>>>>>> >>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We >>>>>>>>> can combine the two steps if we have SchemaTransformProvider return >>>>>>>>> the >>>>>>>>> PTransform directly. >>>>>>>>> >>>>>>>>> We can then remove the SchemaTransform class as it will be >>>>>>>>> obsolete. This should be safe to do; the only place it's used in our >>>>>>>>> API is >>>>>>>>> here [3], and that can be simplified if we make this change (we'd >>>>>>>>> just trim >>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)` >>>>>>>>> will directly return the PTransform). >>>>>>>>> >>>>>>>>> I'd like to first mention that I was not involved in the design >>>>>>>>> process of this API so I may be missing some information on why it >>>>>>>>> was set >>>>>>>>> up this way. >>>>>>>>> >>>>>>>>> A few developers already raised questions about how there's >>>>>>>>> seemingly unnecessary boilerplate involved in making a Java transform >>>>>>>>> portable. I wasn't involved in the design process of this API so I >>>>>>>>> may be >>>>>>>>> missing some information, but my assumption is this was designed to >>>>>>>>> follow >>>>>>>>> the pattern of the previous iteration of this API (SchemaIO): >>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with >>>>>>>>> the newer SchemaTransformProvider API, we dropped a few methods and >>>>>>>>> reduced >>>>>>>>> the SchemaTransform class to have a generic buildTransform() method. >>>>>>>>> See >>>>>>>>> the example of PubsubReadSchemaTransformProvider [6], where the >>>>>>>>> SchemaTransform interface and buildTransform method are >>>>>>>>> implemented just to satisfy the requirement that >>>>>>>>> SchemaTransformProvider::from return a SchemaTransform. >>>>>>>>> >>>>>>>>> I'm bringing this up because if we are looking to encourage >>>>>>>>> contribution to cross-language use cases, we should make it simpler >>>>>>>>> and >>>>>>>>> less convoluted to develop portable transforms. >>>>>>>>> >>>>>>>>> There are a number of SchemaTransforms already developed, but >>>>>>>>> applying these changes to them should be straightforward. If people >>>>>>>>> think >>>>>>>>> this is a good idea, I can open a PR and implement them. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Ahmed >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java >>>>>>>>> [2] >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java >>>>>>>>> [3] >>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138 >>>>>>>>> [4] >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java >>>>>>>>> [5] >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java >>>>>>>>> [6] >>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137 >>>>>>>>> >>>>>>>>