Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

roger peppe Fri, 20 Dec 2019 14:23:57 -0800

Actually, having looked a bit closer, I think I get the gist of what you're
saying (though the IDL spec
<https://avro.apache.org/docs/current/idl.html> doesn't
seem to mention the @logicalType form, so I'm still guessing somewhat).


I'd certainly considered that approach. Essentially you seem to be
suggesting wrapping an arbitrary data payload inside the message.
Let's call my approach "unified" and yours "wrapper".

I think there are some advantages to the unified approach.

- with the unified approach you have a single schema for all the data in
the topic; with the wrapper approach, each topic essentially has two
associated schemas: the metadata (wrapper) schema and the underlying
schema. This makes everything a bit more complex. Perhaps standard tooling
won't be able to use the topic's registered schema to decode the full
messages from the topic. Deciding backward compatibility with the unified
schema is also more straightforward - it can be done with exactly the usual
single-schema compatibility checks (as implemented by the schema registry
for example).

- If you want to pull out both metadata and payload data, you can do so in
a single operation; it's simpler code and simpler conceptually I think.

this way a system that is interested in the metadata does not even have to
> deserialize the payload….


I take this point; it could indeed be more efficient to use the wrapper
approach (although there might be extra data copying costs too). As always
with optimisation, it would be worth measuring. There's one interesting
possibility to get the best of both worlds, actually: if the messages are
written with a schema that has the Metadata field first in the struct and
the reader is only extracting the Metadata field, a sufficiently clever
decoder could stop after the information for that field has been read -
there's no need to read any further. I think that could be just as
efficient and I don't think it would be *that* hard to do.

Thanks very much for your feedback, BTW.

  cheers,
    rog.


On Fri, 20 Dec 2019 at 21:06, Zoltan Farkas <[email protected]> wrote:

> Hi Roger,
>
> have you considered  leveraging  avro logical types, and keep the payload
> and event metadata “separate”?
>
> Here is a example (will use avro idl, since that is more readable to me
> :-) ):
>
> record MetaData {
> @logicalType(“instant") string timeStamp;
> ….. all the meta data fields...
> }
>
> record CloudEvent {
>
> MetaData metaData;
>
> Any payload;
>
> }
>
> @logicalType(“any")
> record Any {
>
> /** here you have the schema of the data, for efficiency, you can use a
> schema id + schema repo, or something like
> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences */
> string schema;
>
> bytes data;
>
> }
>
> this way a system that is interested in the metadata does not even have to
> deserialize the payload….
>
> hope it helps.
>
> —Z
>
>
> On Dec 18, 2019, at 11:49 AM, roger peppe <[email protected]> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to