Hi Chamikara,
>> - why not make this part of the pipeline options? does it really >> need to vary from transform to transform? >> >> It's possible for the same pipeline to connect to multiple expansion > services, to use transforms from more than one SDK language and/or version. > There are only 3 languages supported, so excluding the use’s chosen language we’re only talking about 2 options (i.e. for python, they’re java and go). The reality is that Java provides the superset of all the IO functionality of Go and Python, and the addition of external transforms is only going to discourage the addition of more native IO transforms in python and go (which is ultimately a good thing!). So it seems like a poor UX choice to make users provide the expansion service to every single external IO transform when the reality is that 99.9% of the time it’ll be the same url for any given pipeline. Correct me if I’m wrong, but in a production scenario the expansion service not be the current default, localhost:8097, correct? That means users would need to always specific this arg. Here’s an alternate proposal: instead of providing the expansion service as a URL in a transform’s __init__ arg, i.e. expansion_service='localhost:8097', make it a symbolic name, like expansion_service='java' (an external transform is bound to a particular source SDK, e.g. KafkaIO is bound to Java, so this default seems reasonable to me). Then provide a pipeline option to specify the url of an expansion service alias in the form alias@url (e.g. --expansion_service=java@myexpansionservice:8097). Are you talking about key/value coders of the Kafka external transform ? > Story of coders is bit complicated for cross-language transform. Even if > we get a bytestring from Java, how can we make sure that that is > processable in Python ? For example, it might be a serialized Java object. > IIUC, it’s not as if you support that with the current design, do you? If it’s a Java object that your native IO transform decodes in Java, then how are you going to get that to Python? Presumably the reason it’s encoded as a Java object is because it can’t be represented using a cross-language coder. On the other hand, if I’m authoring a beam pipeline in python using an external transform like PubSubIO, then it’s desirable for me to write a pickled python object to WriteToPubSub and get that back in a ReadFromPubSub in another python-based pipeline. In other words, when it comes to coders, it seems we should be favoring the language that is *using* the external transform, rather than the native language of the transform itself. All of that said, it occurs to me that for ReadFromPubSub, we do explicit decoding in a subsequent transform rather than as part of ReadFromPubSub, so I’m confused why ReadFromKafka needs to know about coders at all. Is that behavior specific to Kafka? > This is great and contributions are welcome. BTW Max and others, do you > think it will help to add an expanded roadmap on cross-language transforms > to [3] that will better describe the current status and future roadmap of > cross-language transform support for various SDKs and runners ? > More info would be great. I’ve started looking at the changes required to make KafkaIO work as an external transform and I have a number of questions already. I’ll probably start asking questions on specific lines this old PR <https://github.com/apache/beam/pull/8251/files> unless you’d like me to use a different forum. thanks, -chad
