@Roger: The CUE schema gets a +1 for the most accurate regex for validating names and namespaces so far! :D It doesn't look like it's being applied to *every* name and namespace attribute though, or am I misreading? I read the schema with just a *minimal* understanding of the language, but it looks like it also expects that fixed data can have a doc.
I would hope that the doc attribute in a fixed data schema could still be retrieved like any other metadata by schema.getObjectProp (at least in the Java API). I'll check! @Jonah: I think I understand your use case a bit better -- thanks for the clarification! Attributes outside of the spec should be OK to use as metadata, and that seems like the right fit for your use case (such as the interesting obfuscation attribute in lenses). Are the avro tools that strip non-spec-attributes/metadata doing something wrong? I can see this happening if they are relying on the Parsing Canonical Form or the fingerprint (based on canonical form), but that is deliberate to remove all differences between two schemas that can be used to parse the same binary data. Note that PCF also removes doc attributes. Is there code in the avro project that is manipulating schemas and stripping metadata silently? I would consider that a bug. For external tools, it could either be a bug or undocumented behaviour. All my best, Ryan On Mon, Dec 9, 2019 at 5:14 PM roger peppe <rogpe...@gmail.com> wrote: > > Somewhat relevant, here is a CUE schema for Avro schemas that I wrote a > little while ago that can be used to check Avro schema compliance to a degree > (if you haven't heard of CUE, there's a bunch of info on it at cuelang.org). > > My understanding of Avro was somewhat less then, so it's probably wrong in > parts, and it's definitely not a strict as it could be, but I've found it > useful, and it has lots of room for improvement. > > cheers, > rog. > > > > On Fri, 6 Dec 2019 at 17:43, Jonah H. Harris <jonah.har...@gmail.com> wrote: >> >> On Fri, Dec 6, 2019 at 12:16 PM Ryan Skraba <r...@skraba.com> wrote: >>> >>> Hello! Yes, it looks like `fixed` is the only named complex type that >>> doesn't have a doc attribute. No primitive types have the doc >>> attribute. >>> >>> This might be an omission, but I don't think it's inconsistent. In my >>> experience, there's no compelling reason to document schemas of >>> primitive types, but a good practice for the fields or container types >>> that they're inside. Fixed is not a primitive type, but in practice >>> it's used like bytes (which is). >> >> >> Hey, Ryan. Thanks for getting back to me so quickly. >> >> Yeah. I don't think primitive types need the doc attribute. As fixed is >> complex and can be an independent type, however, I thought that was >> inconsistent with the other complex types. >> >>> >>> In my opinion, I wouldn't consider it important to make the doc >>> attribute universal on any type/field, but I wouldn't have any strong >>> objection if that were the consensus. Today, I'm pretty sure that the >>> Java implementation corresponds to the spec with regards to the doc >>> attribute. >> >> >> Agreed. >> >>> >>> As a minimum, I'd propose that the only action here is to change the >>> IDL guide: "Comments that begin with /** are used as the documentation >>> string (if applicable) for the type or field definition that follows >>> the comment." >>> >>> Is this what you're looking for? >> >> >> Yes. We're actually using the doc string to store not only a textual >> description of the field/type, but also a set of annotations used for event >> storage and data masking. The main reason we wanted doc to be consistent for >> all complex types (including fixed) is that it permits us to easily tell >> what complex objects can exist across the ecosystem directly from our schema >> repository. Initially, we wanted to use a separate internal attribute >> (similar to the lenses obfuscate attribute approach -- >> https://docs.lenses.io/2.0/install_setup/datagovernance/index.html#data-anonymization >> -- but we've found several Avro tools strip out all non-spec-compliant >> attributes. This leaves us only the doc field. >> >>> P.S. I'm very intrigued by the "thorough schema compliance checker"! >>> Is this something that would be shared? Would it help find other >>> inconsistencies in the Avro spec and implementations? >> >> >> Yes, this will be open-sourced. >> >> -- >> Jonah H. Harris >>