String UTF8 was recently added as a "standard coder " URN in the protos,
but I don't think that developed beyond Java, so adding it to Python would
be reasonable in my opinion.

The Go SDK handles Strings as "custom coders" presently which for Go are
always length prefixed (and reported to the Runner as LP+CustomCoder). It
would be straight forward to add the correct handling for strings, as Go
natively treats strings as UTF8.


On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <heej...@google.com> wrote:

> Hi all,
>
> It looks like UTF-8 String Coder in Java and Python SDKs uses different
> encoding schemes. StringUtf8Coder in Java SDK puts the varint length of the
> input string before actual data bytes however StrUtf8Coder in Python SDK
> directly encodes the input string to bytes value. For the last few weeks,
> I've been testing and fixing cross-language IO transforms and this
> discrepancy is a major blocker for me. IMO, we should unify the encoding
> schemes of UTF8 strings across the different SDKs and make it a standard
> coder. Any thoughts?
>
> Thanks,
>

Reply via email to