Re: [python] ReadFromPubSub broken in Flink

Chamikara Jayalath Sat, 13 Jul 2019 18:44:11 -0700

On Sat, Jul 13, 2019 at 10:35 AM Chad Dombrova <[email protected]> wrote:


>
> Cross-language support for PubSub is not yet implemented but it can be
>> done similarly to ReadFromKafka. There are still some limitations regarding
>> the coders, i.e. only coders can be used which are available in both the
>> Java and the Python SDK (standard coders).
>>
>
> Yeah, I was just looking through the source and noticed a few things right
> off the bat:
>
>    - expansion service needs to be passed as an arg to each external xform
>       - why not make this part of the pipeline options?  does it really
>       need to vary from transform to transform?
>
> It's possible for the same pipeline to connect to multiple expansion
services, to use transforms from more than one SDK language and/or version.
SDK version support is not there yet. Currently we assume that whatever the
SDK (of the language of the cross-language transform) that will be used by
the runner will be compatible with the SDK used during expansion of the
cross-language transform.


>
>    - explicit coders need to be passed to each external xform for each
>    item to be serialized, key and value coders provided separately
>       - in python we get auto-detection of coders based on type hints or
>       data type, including compound data types (e.g. Tuple[int, str, Dict[str,
>       float]])
>       - in python we also have a fallback to the pickle coder for complex
>       types without builtin coders.  is the pickle coder supported by java?
>       - is there a way to express compound java coders as a string?
>       - why not pass the results in and out of the java xform using
>       bystrings, and then use python-based coders in python?
>
>
Are you talking about key/value coders of the Kafka external transform ?
Story of coders is bit complicated for cross-language transform. Even if we
get a bytestring from Java, how can we make sure that that is processable
in Python ? For example, it might be a serialized Java object.

Currently coders used in the language boundary have to either standard
coders explicitly defined in the runner API proto [1] or coders of
language-neutral format that are compatible across SDKs (for example, Avro,
proto). Hopefully portable schema proposal [2] will simplify things and
will give us a better way to define types of PCollectins that are
identifiable across-languages.



> As of now the user experience is a bit rough, but we will be improving
>> that very soon. Happy to help out if you want to contribute a
>> cross-language ReadFromPubSub.
>>
>
> We're pretty locked in to Flink, thus adopting Kafka or PubSub is going to
> be a requirement, so it looks like we're going the external transform route
> either way.  I'd love to hear more about A) what the other limitations of
> external transforms are, and B) what you have planned to improve the UX.
> I'm sure we can find something to contribute!
>

This is great and contributions are welcome. BTW Max and others, do you
think it will help to add an expanded roadmap on cross-language transforms
to [3] that will better describe the current status and future roadmap of
cross-language transform support for various SDKs and runners ?

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L552
[2]
https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit
[3] https://beam.apache.org/roadmap/connectors-multi-sdk/


>
> -chad
>
>

Re: [python] ReadFromPubSub broken in Flink

Reply via email to