Hi Rog, Glad my pointers were useful, the Avro spec really is a marvel.
Regarding your follow-up question, I'm honestly not sure, interesting contrived example however, and interesting that no matter how well written the spec is, it can still be ambiguous. I found this snipped in the 1.9x docs, where I know there was some changes to defaults for complex types, the 1.8 docs may be incomplete in that regard. ( https://avro.apache.org/docs/1.9.0/spec.html#schema_complex ) Default values for union fields correspond to the first schema in the > union. Default values for bytes and fixed fields are JSON strings, where > Unicode code points 0-255 are mapped to unsigned 8-bit byte values 0-255. > I take `Default values for union fields correspond to the first schema in the union` to mean that your default including values from the 2nd schema in the union is invalid, *or* that where the member exists in the first union it refers to the first union, and when not, it refers to the first schema in which it _does_ exist. One way to find out would be to run some data through a couple of common implementations, and see how they handle the resulting data, and, maybe feed that back into Avro docs in the form of a PR if you come up with something useful? Either way, I'm curious now! Let me know when you have an answer? Cheers, Lee Hambley http://lee.hambley.name/ +49 (0) 170 298 5667 On Thu, 5 Dec 2019 at 14:07, roger peppe <rogpe...@gmail.com> wrote: > On Wed, 4 Dec 2019 at 11:38, Lee Hambley <lee.hamb...@gmail.com> wrote: > >> HI Rog, >> >> Good question, the answer lay in the docs in the "Parsing Canonical Form >> for Schemas" where it states (amongst all the other transformation rules) >> >> [ORDER] Order the appearance of fields of JSON objects as follows: *name*, >>> type, * fields*, symbols, items, values, size. For example, if an >>> object has type, name, and size fields, then the name field should >>> appear first, followed by the type and then the size fields. >> >> >> (emphasis mine) >> >> The canonical form for schemas becomes more relevant to Avro usage when >> working with a schema registry for e.g, but it's a really common use-case >> and I consider definition of a canonical form for schema comparisons to be >> a strength of Avro compared with other serialization formats. >> >> - >> https://avro.apache.org/docs/1.8.2/spec.html#Parsing+Canonical+Form+for+Schemas >> > > Thanks very much - I'd missed that, very helpful! > > Maybe you might be able to help with another part of the spec that I've > been puzzling over too: default values for complex types. > The spec doesn't seem to say how unions in complex types are specified > when in default values. > > For example, consider the following schema: > > { > "type": "record", > "name": "R", > "fields": [ > { > "name": "F", > "type": { > "type": "array", > "items": [ > { > "type": "enum", > "name": "E1", > "symbols": ["A", "B"] > }, > { > "type": "enum", > "name": "E2", > "symbols": ["B", "A", "C"] > } > ] > }, > "default": ["A", "B", "C"] > } > ] > } > > This seems like it should be valid according to the spec, because default > value encodings don't encode the type name in enums, unlike in the JSON > encoding, but in this case there seems to way to tell which enum types end > up in the array value of the field F, because the enum symbols themselves > are ambiguous. > > How are schema validators meant to resolve this ambiguity? > > cheers, > rog. > > >> HTH, >> >> Lee Hambley >> http://lee.hambley.name/ >> +49 (0) 170 298 5667 >> >> >> On Wed, 4 Dec 2019 at 12:17, roger peppe <rogpe...@gmail.com> wrote: >> >>> Hi, >>> >>> My apologies in advance if this topic has been well discussed before - >>> the mailing list search tool appears to be broken (the link points to the >>> expired domain name "search-hadoop.com"). >>> >>> I'm trying to understand about recursive types in Avro, given that the >>> specification says about names >>> <http://avro.apache.org/docs/current/spec.html#names>: >>> >>> a name must be defined before it is used ("before" in the depth-first, >>>> left-to-right traversal of the JSON parse tree, where the types attribute >>>> of a protocol is always deemed to come "before" the messages >>>> attribute.) >>> >>> >>> By my reading, this would make the following Avro schema invalid, >>> because the name "R" will not yet be defined when it's referenced inside >>> the type of the field F, because in depth-first order, the leaf is >>> traversed before the root. >>> >>> { >>> "type": "record", >>> "fields": [ >>> {"name": "F", "type": ["null", "R"]} >>> ], >>> "name": "R" >>> } >>> >>> It seems that types like this are valid in practice (I found the above >>> example in an Avro test suite), so could someone enlighten me as to how >>> this is allowed, please? >>> >>> Thanks for any info. If I'm asking in the wrong place, please advise me >>> of a better forum! >>> >>> rog. >>> >>> >>>