[ 
https://issues.apache.org/jira/browse/AVRO-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240470#comment-14240470
 ] 

Aaron Kimball commented on AVRO-1618:
-------------------------------------

So, are you proposing making something that works without integrating with 
DatumReader; i.e. a standalone decoder of a stream of UTF-8 characters?

The python {{BinaryDecoder}} isn't actually aware of a reader or writer schema; 
that's the job of the {{DatumReader}}. The BinaryDecoder receives as an 
attribute only a stream which supports {{read}}, {{seek}} and {{tell}}; its 
methods only provide for the deserialization of individual strings, 
integers/longs, booleans and floats. There are no "objects" per se returned by 
this level of the API... This is why I'm having a harder time envisioning how 
to make this a "compatible" API without pulling virtually all of io.py apart. 
The DatumReader implementation is written at a very low level and seems 
particularly specialized to the binary encoding.

I tried looking at the *Decoders in Java.. they have some extra methods which 
speak to some othe higher level types, but the lack of comments there makes it 
difficult for me to understand what's going on for the complex types in these 
implementations as well. (I get lost in all the interplay between 
JsonDecoder/BinaryDecoder and the superclasses ResolvingDecoder and 
ValidatingDecoder.)

Thoughts?

I think I would be ok with a JsonDecoder that doesn't work with / require 
DatumReader. (or maybe call it JsonDatumReader?)

Also, looking at the Avro spec at 
http://avro.apache.org/docs/current/spec.html#json_encoding it seems a bit 
underspecified as regards what a "stream of records" in JSON encoding looks 
like. I could imagine:

(a) a series of records independently encoded  as UTF-8 JSON and concatenated; 
e.g. like this:
{code}{ "foo" : "bar1" }{ "foo" : "bar2" }{code}

(b) a series of records rendered as a JSON array:
{code}[{ "foo" : "bar1" }, { "foo" : "bar2" }]{code}

(c) something else entirely; e.g. {{\n}}-delimited records.

How is this implemented in other languages? Since the spec does not include any 
inter-datum tokens in the binary encoding either, I assume that those are just 
concatenated, and that this would assume the same for JSON (i.e., mechanism "a" 
above). Is this intuition correct?


> Allow user to "clean up" unions into more conventional dicts in json encoding
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-1618
>                 URL: https://issues.apache.org/jira/browse/AVRO-1618
>             Project: Avro
>          Issue Type: Improvement
>          Components: python
>    Affects Versions: 1.7.7
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: avro-1618.1.patch
>
>
> In Avro's JSON encoding, unions are implemented in a tagged fashion; walking 
> through this data structure is somewhat cumbersome. It would be good to have 
> a way of "decoding" this tagged-union data structure into a more conventional 
> dict where the union element is directly present without the tag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to