On Sat, Jul 13, 2019 at 10:35 AM Chad Dombrova <[email protected]> wrote:
> > Cross-language support for PubSub is not yet implemented but it can be >> done similarly to ReadFromKafka. There are still some limitations regarding >> the coders, i.e. only coders can be used which are available in both the >> Java and the Python SDK (standard coders). >> > > Yeah, I was just looking through the source and noticed a few things right > off the bat: > > - expansion service needs to be passed as an arg to each external xform > - why not make this part of the pipeline options? does it really > need to vary from transform to transform? > > It's possible for the same pipeline to connect to multiple expansion services, to use transforms from more than one SDK language and/or version. SDK version support is not there yet. Currently we assume that whatever the SDK (of the language of the cross-language transform) that will be used by the runner will be compatible with the SDK used during expansion of the cross-language transform. > > - explicit coders need to be passed to each external xform for each > item to be serialized, key and value coders provided separately > - in python we get auto-detection of coders based on type hints or > data type, including compound data types (e.g. Tuple[int, str, Dict[str, > float]]) > - in python we also have a fallback to the pickle coder for complex > types without builtin coders. is the pickle coder supported by java? > - is there a way to express compound java coders as a string? > - why not pass the results in and out of the java xform using > bystrings, and then use python-based coders in python? > > Are you talking about key/value coders of the Kafka external transform ? Story of coders is bit complicated for cross-language transform. Even if we get a bytestring from Java, how can we make sure that that is processable in Python ? For example, it might be a serialized Java object. Currently coders used in the language boundary have to either standard coders explicitly defined in the runner API proto [1] or coders of language-neutral format that are compatible across SDKs (for example, Avro, proto). Hopefully portable schema proposal [2] will simplify things and will give us a better way to define types of PCollectins that are identifiable across-languages. > As of now the user experience is a bit rough, but we will be improving >> that very soon. Happy to help out if you want to contribute a >> cross-language ReadFromPubSub. >> > > We're pretty locked in to Flink, thus adopting Kafka or PubSub is going to > be a requirement, so it looks like we're going the external transform route > either way. I'd love to hear more about A) what the other limitations of > external transforms are, and B) what you have planned to improve the UX. > I'm sure we can find something to contribute! > This is great and contributions are welcome. BTW Max and others, do you think it will help to add an expanded roadmap on cross-language transforms to [3] that will better describe the current status and future roadmap of cross-language transform support for various SDKs and runners ? Thanks, Cham [1] https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L552 [2] https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit [3] https://beam.apache.org/roadmap/connectors-multi-sdk/ > > -chad > >
