+1 in general (I will dig into details too, but later) I will add a couple more steps that I dislike, which exist only for historical obsolete reasons:
The whole existence of runners/core-construction-java and everything in it. The main reason it exists is to make it so that main SDK has no dependency on protobuf. We thought maybe the portable model would be generic enough to have multiple possible encodings. That idea is obsolete. We are using protobuf. So that whole module, everything about generating proto for transforms, could be in the core SDK to remove the need for the service loader pattern, which often breaks when users build uber jars, and just generally is overused and needless complexity here. Not sure how this relates in detail to your proposal, but I just wanted to weigh in with a big +1 to making it easy to make a portable schema-based transform by default. Imagine a world where the Java-centric API is actually a wrapper on the schema transform! Kenn On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <dev@beam.apache.org> wrote: > Hey everyone, > > I was looking at how we use SchemaTransforms in our expansion service. > From what I see, there may be a redundant step in developing > SchemaTransforms. Currently, we have 3 pieces: > - SchemaTransformProvider [1] > - A configuration object > - SchemaTransform [2] > > The API is generally used like this: > 1. The SchemaTransformProvider takes a configuration object and returns a > SchemaTransform > 2. The SchemaTransform is used to build a PTransform according to the > configuration > > In these steps, the SchemaTransform class seems unnecessary. We can > combine the two steps if we have SchemaTransformProvider return the > PTransform directly. > > We can then remove the SchemaTransform class as it will be obsolete. This > should be safe to do; the only place it's used in our API is here [3], and > that can be simplified if we make this change (we'd just trim ` > .buildTransform()` off the end as `provider.from(configRow)` will > directly return the PTransform). > > I'd like to first mention that I was not involved in the design process of > this API so I may be missing some information on why it was set up this way. > > A few developers already raised questions about how there's seemingly > unnecessary boilerplate involved in making a Java transform portable. I > wasn't involved in the design process of this API so I may be missing some > information, but my assumption is this was designed to follow the pattern > of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] -> > SchemaIO[5] -> PTransform. However, with the newer > SchemaTransformProvider API, we dropped a few methods and reduced the > SchemaTransform class to have a generic buildTransform() method. See the > example of PubsubReadSchemaTransformProvider [6], where the > SchemaTransform interface and buildTransform method are implemented just > to satisfy the requirement that SchemaTransformProvider::from return a > SchemaTransform. > > I'm bringing this up because if we are looking to encourage contribution > to cross-language use cases, we should make it simpler and less convoluted > to develop portable transforms. > > There are a number of SchemaTransforms already developed, but applying > these changes to them should be straightforward. If people think this is a > good idea, I can open a PR and implement them. > > Best, > Ahmed > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java > [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java > [3] > https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138 > [4] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java > [5] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java > [6] > https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137 >