This is a bit weird, and it could be clarified. The "same" schema, of course, can be represented by different JSON text files: attribute order, ways to express the full name, inheriting full names, etc. The canonical form is always the "same" for the "same" schema and can be used to generate a fingerprint. Which leads to the confusing question: what does SAME mean, exactly?
My understanding is that it defines the minimum you need to *serialize* the data (the writer, or actual schema). If you have a SAME schema when reading that data, it is guaranteed to be able to deserialize it, even if you've changed some extra attributes in the reader (or expected) schema: how the namespace is set, added or removed a logical type on a field, or (you guessed it) if you've changed the default that you want to use to cover missing data or enums. With the rise of streaming data, and schema registries, there probably is a new need for a definition of SAME that includes schema evolution attributes. I think there's a good JIRA that describes this, but the parsing canonical form does NOT meet that need. If I've made a mistake here, feel free to jump in with your clarification! All my best, Ryan On Sat, Jul 29, 2023 at 5:06 PM Michael A. Smith <mich...@smith-li.com> wrote: > > The spec says one of the steps to get parsing canonical form is > > > [STRIP] Keep only attributes that are relevant to parsing data, which are: > > type, name, fields, symbols, items, values, size. Strip all others (e.g., > > doc and aliases). > > and indeed, we strip the default from an EnumSchema. But is that > right? It seems to me that we'd want to keep that. Can someone help me > understand if (and how) it's correct to strip the enum default in > parsing canonical form? > > Thanks, > Michael