[
https://issues.apache.org/jira/browse/AVRO-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240470#comment-14240470
]
Aaron Kimball commented on AVRO-1618:
-------------------------------------
So, are you proposing making something that works without integrating with
DatumReader; i.e. a standalone decoder of a stream of UTF-8 characters?
The python {{BinaryDecoder}} isn't actually aware of a reader or writer schema;
that's the job of the {{DatumReader}}. The BinaryDecoder receives as an
attribute only a stream which supports {{read}}, {{seek}} and {{tell}}; its
methods only provide for the deserialization of individual strings,
integers/longs, booleans and floats. There are no "objects" per se returned by
this level of the API... This is why I'm having a harder time envisioning how
to make this a "compatible" API without pulling virtually all of io.py apart.
The DatumReader implementation is written at a very low level and seems
particularly specialized to the binary encoding.
I tried looking at the *Decoders in Java.. they have some extra methods which
speak to some othe higher level types, but the lack of comments there makes it
difficult for me to understand what's going on for the complex types in these
implementations as well. (I get lost in all the interplay between
JsonDecoder/BinaryDecoder and the superclasses ResolvingDecoder and
ValidatingDecoder.)
Thoughts?
I think I would be ok with a JsonDecoder that doesn't work with / require
DatumReader. (or maybe call it JsonDatumReader?)
Also, looking at the Avro spec at
http://avro.apache.org/docs/current/spec.html#json_encoding it seems a bit
underspecified as regards what a "stream of records" in JSON encoding looks
like. I could imagine:
(a) a series of records independently encoded as UTF-8 JSON and concatenated;
e.g. like this:
{code}{ "foo" : "bar1" }{ "foo" : "bar2" }{code}
(b) a series of records rendered as a JSON array:
{code}[{ "foo" : "bar1" }, { "foo" : "bar2" }]{code}
(c) something else entirely; e.g. {{\n}}-delimited records.
How is this implemented in other languages? Since the spec does not include any
inter-datum tokens in the binary encoding either, I assume that those are just
concatenated, and that this would assume the same for JSON (i.e., mechanism "a"
above). Is this intuition correct?
> Allow user to "clean up" unions into more conventional dicts in json encoding
> -----------------------------------------------------------------------------
>
> Key: AVRO-1618
> URL: https://issues.apache.org/jira/browse/AVRO-1618
> Project: Avro
> Issue Type: Improvement
> Components: python
> Affects Versions: 1.7.7
> Reporter: Aaron Kimball
> Assignee: Aaron Kimball
> Attachments: avro-1618.1.patch
>
>
> In Avro's JSON encoding, unions are implemented in a tagged fashion; walking
> through this data structure is somewhat cumbersome. It would be good to have
> a way of "decoding" this tagged-union data structure into a more conventional
> dict where the union element is directly present without the tag.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)