Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

roger peppe Fri, 20 Dec 2019 13:51:14 -0800

Hi,

Excuse my ignorance, but I'm not at all familiar with IDL. Is there an easy
way to translate it to a JSON Avro schema, please? (preferably online :))


 cheers,
   rog.

On Fri, 20 Dec 2019 at 21:06, Zoltan Farkas <zolyfar...@yahoo.com> wrote:

> Hi Roger,
>
> have you considered  leveraging  avro logical types, and keep the payload
> and event metadata “separate”?
>
> Here is a example (will use avro idl, since that is more readable to me
> :-) ):
>
> record MetaData {
> @logicalType(“instant") string timeStamp;
> ….. all the meta data fields...
> }
>
> record CloudEvent {
>
> MetaData metaData;
>
> Any payload;
>
> }
>
> @logicalType(“any")
> record Any {
>
> /** here you have the schema of the data, for efficiency, you can use a
> schema id + schema repo, or something like
> https://github.com/zolyfarkas/jaxrs-spf4j-demo/wiki/AvroReferences */
> string schema;
>
> bytes data;
>
> }
>
> this way a system that is interested in the metadata does not even have to
> deserialize the payload….
>
> hope it helps.
>
> —Z
>
>
> On Dec 18, 2019, at 11:49 AM, roger peppe <rogpe...@gmail.com> wrote:
>
> Hi,
>
> Background: I've been contemplating the proposed Avro format in the CloudEvent
> specification
> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
> defines standard metadata for events. It defines a very generic format for
> an event that allows storage of almost any data. It seems to me that by
> going in that direction it's losing almost all the advantages of using Avro
> in the first place. It feels like it's trying to shoehorn a dynamic message
> format like JSON into the Avro format, where using Avro itself could do so
> much better.
>
> I'm hoping to propose something better. I had what I thought was a nice
> idea, but it doesn't *quite* work, and I thought I'd bring up the subject
> here and see if anyone had some better ideas.
>
> The schema resolution
> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
> of the spec allows a reader to read a schema that was written with extra
> fields. So, theoretically, we could define a CloudEvent something like this:
>
> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
> "type": { "type": "record", "name": "CloudEvent", "namespace": "
> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
> "source", "type": "string" }, { "name": "time", "type": "long", "
> logicalType": "timestamp-micros" }] } }] }
>
> Theoretically, this could enable any event that's a record that has *at
> least* a Metadata field with the above fields to be read generically. The
> CloudEvent type above could be seen as a structural supertype of all
> possible more-specific CloudEvent-compatible records that have such a
> compatible field.
>
> This has a few nice advantages:
> - there's no need for any wrapping of payload data.
> - the CloudEvent type can evolve over time like any other Avro type.
> - all the data message fields are immediately available alongside the
> metadata.
> - there's still exactly one schema for a topic, encapsulating both the
> metadata and the payload.
>
> However, this idea fails because of one problem - this schema resolution
> rule: "both schemas are records with the same (unqualified) name". This
> means that unless *everyone* names all their CloudEvent-compatible
> records "CloudEvent", they can't be read like this.
>
> I don't think people will be willing to name all their records
> "CloudEvent", so we have a problem.
>
> I can see a few possible workarounds:
>
>    1. when reading the record as a CloudEvent, read it with a schema
>    that's the same as CloudEvent, but with the top level record name changed
>    to the top level name of the schema that was used to write the record.
>    2. ignore record names when matching schema record types.
>    3. allow aliases to be specified when writing data as well as reading
>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>    alias to your record.
>
> None of the options are particularly nice. 1 is probably the easiest to
> do, although means you'd still need some custom logic when decoding
> records, meaning you couldn't use stock decoders.
>
> I like the idea of 2, although it gets a bit tricky when dealing with
> union types. You could define the matching such that it ignores names only
> when the two matched types are unambiguous (i.e. only one record in both).
> This could be implemented as an option ("use structural typing") when
> decoding.
>
> 3 is probably cleanest but interacts significantly with the spec (for
> example, the canonical schema transformation strips aliases out, but they'd
> need to be retained).
>
> Any thoughts? Is this a silly thing to be contemplating? Is there a better
> way?
>
>   cheers,
>     rog.
>
>
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to