Hi,

The JSON encoding in the specification
<https://avro.apache.org/docs/current/spec.html#json_encoding> includes an
explicit type name for all kinds of object other than null. This means that
a JSON-encoded Avro value with a union is very rarely directly compatible
with normal JSON formats.

For example, it's very common for a JSON-encoded value to allow a value
that's either null or string. In Avro, that's trivially expressed as the
union type ["null", "string"]. With conventional JSON, a string value "foo"
would be encoded just as "foo", which is easily distinguished from null
when decoding. However when using the Avro JSON format it must be encoded
as {"string": "foo"}.

This means that Avro JSON-encoded values don't interchange easily with
other JSON-encoded values.

AFAICS the main reason that the type name is always required in
JSON-encoded unions is to avoid ambiguity. This particularly applies to
record and map types, where it's not possible in general to tell which
member of the union has been specified by looking at the data itself.

However, that reasoning doesn't apply if all the members of the union can
be distinguished from their JSON token type.

I am considering using a JSON encoding that omits the type name when all
the members of the union encode to distinct JSON token types (the JSON
token types being: null, boolean, string, number, object and array).

For example, JSON-encoded values using the Avro schema ["null", "string",
"int"] would encode as the literal values themselves (e.g. null, "foo", 999),
but JSON-encoded values using the Avro schema ["int", "double"] would
require the type name because the JSON lexeme doesn't distinguish between
different kinds of number.

This would mean that it would be possible to represent a significant subset
of "normal" JSON schemas with Avro. It seems to me that would potentially
be very useful.

Thoughts? Is this a really bad idea to be contemplating? :)

  cheers,
    rog.

Reply via email to