Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

roger peppe Sat, 28 Dec 2019 08:29:47 -0800

On Sat, 21 Dec 2019, 17:09 Vance Duncan, <dunca...@gmail.com> wrote:

> I suggest naming the timestamp field "timestamp" rather than "time". You
> might also want to consider calling it "eventTimestamp", since there will
> possibly be the need to distinguish when the event occurred vs. when it was
> actually published, due to delays in batching, intermittent downtime, etc.
>
> Also, I suggest considering the addition of traceability metadata, which
> for any practical implementation is almost always required. An array of
> correlation ID's is great for that. It gives the publishers/subscribers a
> way of tracing the events to the external causes. Also possibly an array of
> "priorEventIds". This way a full tree of traceability can be established
> post facto.
>


Your suggestions sound good, but I'm unfortunately not in a position to
define those things at this time - the existing CloudEvent specification
defines names and semantics for those fields already (see
https://github.com/cloudevents/spec/blob/v1.0/spec.md)

I am just trying to define a reasonable way of idiomatically encapsulating
those existing CloudEvent semantics within the Avro format.

(You might notice that I omitted some fields which are arguably redundant
when one knows the writer's schema, eg. data content type and data schema).

  cheers,
    rog.


> On Wed, Dec 18, 2019 at 11:49 AM roger peppe <rogpe...@gmail.com> wrote:
>
>> Hi,
>>
>> Background: I've been contemplating the proposed Avro format in the 
>> CloudEvent
>> specification
>> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which
>> defines standard metadata for events. It defines a very generic format for
>> an event that allows storage of almost any data. It seems to me that by
>> going in that direction it's losing almost all the advantages of using Avro
>> in the first place. It feels like it's trying to shoehorn a dynamic message
>> format like JSON into the Avro format, where using Avro itself could do so
>> much better.
>>
>> I'm hoping to propose something better. I had what I thought was a nice
>> idea, but it doesn't *quite* work, and I thought I'd bring up the
>> subject here and see if anyone had some better ideas.
>>
>> The schema resolution
>> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part
>> of the spec allows a reader to read a schema that was written with extra
>> fields. So, theoretically, we could define a CloudEvent something like this:
>>
>> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata",
>> "type": { "type": "record", "name": "CloudEvent", "namespace": "
>> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name":
>> "source", "type": "string" }, { "name": "time", "type": "long", "
>> logicalType": "timestamp-micros" }] } }] }
>>
>> Theoretically, this could enable any event that's a record that has *at
>> least* a Metadata field with the above fields to be read generically.
>> The CloudEvent type above could be seen as a structural supertype of all
>> possible more-specific CloudEvent-compatible records that have such a
>> compatible field.
>>
>> This has a few nice advantages:
>> - there's no need for any wrapping of payload data.
>> - the CloudEvent type can evolve over time like any other Avro type.
>> - all the data message fields are immediately available alongside the
>> metadata.
>> - there's still exactly one schema for a topic, encapsulating both the
>> metadata and the payload.
>>
>> However, this idea fails because of one problem - this schema resolution
>> rule: "both schemas are records with the same (unqualified) name". This
>> means that unless *everyone* names all their CloudEvent-compatible
>> records "CloudEvent", they can't be read like this.
>>
>> I don't think people will be willing to name all their records
>> "CloudEvent", so we have a problem.
>>
>> I can see a few possible workarounds:
>>
>>    1. when reading the record as a CloudEvent, read it with a schema
>>    that's the same as CloudEvent, but with the top level record name changed
>>    to the top level name of the schema that was used to write the record.
>>    2. ignore record names when matching schema record types.
>>    3. allow aliases to be specified when writing data as well as reading
>>    it. When defining a CloudEvent-compatible event, you'd add a CloudEvent
>>    alias to your record.
>>
>> None of the options are particularly nice. 1 is probably the easiest to
>> do, although means you'd still need some custom logic when decoding
>> records, meaning you couldn't use stock decoders.
>>
>> I like the idea of 2, although it gets a bit tricky when dealing with
>> union types. You could define the matching such that it ignores names only
>> when the two matched types are unambiguous (i.e. only one record in both).
>> This could be implemented as an option ("use structural typing") when
>> decoding.
>>
>> 3 is probably cleanest but interacts significantly with the spec (for
>> example, the canonical schema transformation strips aliases out, but they'd
>> need to be retained).
>>
>> Any thoughts? Is this a silly thing to be contemplating? Is there a
>> better way?
>>
>>   cheers,
>>     rog.
>>
>>
>
> --
> Regards,
>
> Vance Duncan
> mailto:dunca...@gmail.com
> http://www.linkedin.com/in/VanceDuncan
> (904) 553-5582
>

Re: name-agnostic schema resolution (a.k.a. structural subtyping?)

Reply via email to