Re: [python] ReadFromPubSub broken in Flink

Chad Dombrova Sat, 13 Jul 2019 19:41:37 -0700

Hi Chamikara,


>>    - why not make this part of the pipeline options?  does it really
>>       need to vary from transform to transform?
>>
>> It's possible for the same pipeline to connect to multiple expansion
> services, to use transforms from more than one SDK language and/or version.
>
There are only 3 languages supported, so excluding the use’s chosen
language we’re only talking about 2 options (i.e. for python, they’re java
and go). The reality is that Java provides the superset of all the IO
functionality of Go and Python, and the addition of external transforms is
only going to discourage the addition of more native IO transforms in
python and go (which is ultimately a good thing!). So it seems like a poor
UX choice to make users provide the expansion service to every single
external IO transform when the reality is that 99.9% of the time it’ll be
the same url for any given pipeline. Correct me if I’m wrong, but in a
production scenario the expansion service not be the current default,
localhost:8097, correct? That means users would need to always specific
this arg.

Here’s an alternate proposal: instead of providing the expansion service as
a URL in a transform’s __init__ arg, i.e. expansion_service='localhost:8097',
make it a symbolic name, like expansion_service='java' (an external
transform is bound to a particular source SDK, e.g. KafkaIO is bound to
Java, so this default seems reasonable to me). Then provide a pipeline
option to specify the url of an expansion service alias in the form
alias@url (e.g. --expansion_service=java@myexpansionservice:8097).

Are you talking about key/value coders of the Kafka external transform ?
> Story of coders is bit complicated for cross-language transform. Even if
> we get a bytestring from Java, how can we make sure that that is
> processable in Python ? For example, it might be a serialized Java object.
>
IIUC, it’s not as if you support that with the current design, do you? If
it’s a Java object that your native IO transform decodes in Java, then how
are you going to get that to Python? Presumably the reason it’s encoded as
a Java object is because it can’t be represented using a cross-language
coder.

On the other hand, if I’m authoring a beam pipeline in python using an
external transform like PubSubIO, then it’s desirable for me to write a
pickled python object to WriteToPubSub and get that back in a
ReadFromPubSub in another python-based pipeline. In other words, when it
comes to coders, it seems we should be favoring the language that is *using*
the external transform, rather than the native language of the transform
itself.

All of that said, it occurs to me that for ReadFromPubSub, we do explicit
decoding in a subsequent transform rather than as part of ReadFromPubSub,
so I’m confused why ReadFromKafka needs to know about coders at all. Is
that behavior specific to Kafka?


> This is great and contributions are welcome. BTW Max and others, do you
> think it will help to add an expanded roadmap on cross-language transforms
> to [3] that will better describe the current status and future roadmap of
> cross-language transform support for various SDKs and runners ?
>
More info would be great. I’ve started looking at the changes required to
make KafkaIO work as an external transform and I have a number of questions
already. I’ll probably start asking questions on specific lines this old PR
<https://github.com/apache/beam/pull/8251/files> unless you’d like me to
use a different forum.

thanks,
-chad

Re: [python] ReadFromPubSub broken in Flink

Reply via email to