This is a bit weird, and it could be clarified.

The "same" schema, of course, can be represented by different JSON
text files: attribute order, ways to express the full name, inheriting
full names, etc.  The canonical form is always the "same" for the
"same" schema and can be used to generate a fingerprint.  Which leads
to the confusing question: what does SAME mean, exactly?

My understanding is that it defines the minimum you need to
*serialize* the data (the writer, or actual schema).  If you have a
SAME schema when reading that data, it is guaranteed to be able to
deserialize it, even if you've changed some extra attributes in the
reader (or expected) schema: how the namespace is set, added or
removed a logical type on a field, or (you guessed it) if you've
changed the default that you want to use to cover missing data or
enums.

With the rise of streaming data, and schema registries, there probably
is a new need for a definition of SAME that includes schema evolution
attributes.  I think there's a good JIRA that describes this, but the
parsing canonical form does NOT meet that need.

If I've made a mistake here, feel free to jump in with your clarification!

All my best, Ryan








On Sat, Jul 29, 2023 at 5:06 PM Michael A. Smith <mich...@smith-li.com> wrote:
>
> The spec says one of the steps to get parsing canonical form is
>
> > [STRIP] Keep only attributes that are relevant to parsing data, which are: 
> > type, name, fields, symbols, items, values, size. Strip all others (e.g., 
> > doc and aliases).
>
> and indeed, we strip the default from an EnumSchema. But is that
> right? It seems to me that we'd want to keep that. Can someone help me
> understand if (and how) it's correct to strip the enum default in
> parsing canonical form?
>
> Thanks,
> Michael

Reply via email to