Hello! I had a Java unit test ready to go (looking at default values for complex types for AVRO-2636), so just reporting back (the easy work!):
1. In Java, the schema above is parsed without error, but when attempting to use the default value, it fails with a NullPointerException (trying to find the symbol C in E1). 2. If you were to disambiguate the symbols using the Avro JSON encoding ("default": [{"E1":"B"},{"E2":"A"},{"E2":"C"}]), Java fails while parsing the schema: org.apache.avro.AvroTypeException: Invalid default for field F: [{"E1":"B"},{"E2":"A"},{"E2":"C"}] not a {"type":"array","items":[{"type":"enum","name":"E1","symbols":["A","B"]},{"type":"enum","name":"E2","symbols":["B","A","C"]}]} at org.apache.avro.Schema.validateDefault(Schema.java:1542) at org.apache.avro.Schema.access$500(Schema.java:87) at org.apache.avro.Schema$Field.<init>(Schema.java:523) at org.apache.avro.Schema.parse(Schema.java:1649) at org.apache.avro.Schema$Parser.parse(Schema.java:1396) at org.apache.avro.Schema$Parser.parse(Schema.java:1384) It seems that Java implements `Only the first schema in any union can be used in a default value` as opposed to `Default values for union fields correspond to the first schema in the union` (in the example, it isn't a union field). Naively, I would expect any JSON encoded data to be a valid default value (which is not what the spec says). Does anyone know why the "first schema only" rule was added to the spec? Best regards, Ryan On Thu, Dec 5, 2019 at 7:01 PM Lee Hambley <lee.hamb...@gmail.com> wrote: > > Hi Rog, > > Glad my pointers were useful, the Avro spec really is a marvel. > > Regarding your follow-up question, I'm honestly not sure, interesting > contrived example however, and interesting that no matter how well written > the spec is, it can still be ambiguous. > > I found this snipped in the 1.9x docs, where I know there was some changes to > defaults for complex types, the 1.8 docs may be incomplete in that regard. ( > https://avro.apache.org/docs/1.9.0/spec.html#schema_complex ) > >> Default values for union fields correspond to the first schema in the union. >> Default values for bytes and fixed fields are JSON strings, where Unicode >> code points 0-255 are mapped to unsigned 8-bit byte values 0-255. > > > I take `Default values for union fields correspond to the first schema in the > union` to mean that your default including values from the 2nd schema in the > union is invalid, *or* that where the member exists in the first union it > refers to the first union, and when not, it refers to the first schema in > which it _does_ exist. > > One way to find out would be to run some data through a couple of common > implementations, and see how they handle the resulting data, and, maybe feed > that back into Avro docs in the form of a PR if you come up with something > useful? > > Either way, I'm curious now! Let me know when you have an answer? > > Cheers, > > Lee Hambley > http://lee.hambley.name/ > +49 (0) 170 298 5667 > > > On Thu, 5 Dec 2019 at 14:07, roger peppe <rogpe...@gmail.com> wrote: >> >> On Wed, 4 Dec 2019 at 11:38, Lee Hambley <lee.hamb...@gmail.com> wrote: >>> >>> HI Rog, >>> >>> Good question, the answer lay in the docs in the "Parsing Canonical Form >>> for Schemas" where it states (amongst all the other transformation rules) >>> >>>> [ORDER] Order the appearance of fields of JSON objects as follows: name, >>>> type, fields, symbols, items, values, size. For example, if an object has >>>> type, name, and size fields, then the name field should appear first, >>>> followed by the type and then the size fields. >>> >>> >>> (emphasis mine) >>> >>> The canonical form for schemas becomes more relevant to Avro usage when >>> working with a schema registry for e.g, but it's a really common use-case >>> and I consider definition of a canonical form for schema comparisons to be >>> a strength of Avro compared with other serialization formats. >>> >>> - >>> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas >> >> >> Thanks very much - I'd missed that, very helpful! >> >> Maybe you might be able to help with another part of the spec that I've been >> puzzling over too: default values for complex types. >> The spec doesn't seem to say how unions in complex types are specified when >> in default values. >> >> For example, consider the following schema: >> >> { >> "type": "record", >> "name": "R", >> "fields": [ >> { >> "name": "F", >> "type": { >> "type": "array", >> "items": [ >> { >> "type": "enum", >> "name": "E1", >> "symbols": ["A", "B"] >> }, >> { >> "type": "enum", >> "name": "E2", >> "symbols": ["B", "A", "C"] >> } >> ] >> }, >> "default": ["A", "B", "C"] >> } >> ] >> } >> >> This seems like it should be valid according to the spec, because default >> value encodings don't encode the type name in enums, unlike in the JSON >> encoding, but in this case there seems to way to tell which enum types end >> up in the array value of the field F, because the enum symbols themselves >> are ambiguous. >> >> How are schema validators meant to resolve this ambiguity? >> >> cheers, >> rog. >> >>> >>> HTH, >>> >>> Lee Hambley >>> http://lee.hambley.name/ >>> +49 (0) 170 298 5667 >>> >>> >>> On Wed, 4 Dec 2019 at 12:17, roger peppe <rogpe...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> My apologies in advance if this topic has been well discussed before - the >>>> mailing list search tool appears to be broken (the link points to the >>>> expired domain name "search-hadoop.com"). >>>> >>>> I'm trying to understand about recursive types in Avro, given that the >>>> specification says about names: >>>> >>>>> a name must be defined before it is used ("before" in the depth-first, >>>>> left-to-right traversal of the JSON parse tree, where the types attribute >>>>> of a protocol is always deemed to come "before" the messages attribute.) >>>> >>>> >>>> By my reading, this would make the following Avro schema invalid, because >>>> the name "R" will not yet be defined when it's referenced inside the type >>>> of the field F, because in depth-first order, the leaf is traversed before >>>> the root. >>>> >>>> { >>>> "type": "record", >>>> "fields": [ >>>> {"name": "F", "type": ["null", "R"]} >>>> ], >>>> "name": "R" >>>> } >>>> >>>> It seems that types like this are valid in practice (I found the above >>>> example in an Avro test suite), so could someone enlighten me as to how >>>> this is allowed, please? >>>> >>>> Thanks for any info. If I'm asking in the wrong place, please advise me of >>>> a better forum! >>>> >>>> rog. >>>> >>>>