On Sat, 21 Dec 2019, 17:09 Vance Duncan, <dunca...@gmail.com> wrote: > I suggest naming the timestamp field "timestamp" rather than "time". You > might also want to consider calling it "eventTimestamp", since there will > possibly be the need to distinguish when the event occurred vs. when it was > actually published, due to delays in batching, intermittent downtime, etc. > > Also, I suggest considering the addition of traceability metadata, which > for any practical implementation is almost always required. An array of > correlation ID's is great for that. It gives the publishers/subscribers a > way of tracing the events to the external causes. Also possibly an array of > "priorEventIds". This way a full tree of traceability can be established > post facto. >
Your suggestions sound good, but I'm unfortunately not in a position to define those things at this time - the existing CloudEvent specification defines names and semantics for those fields already (see https://github.com/cloudevents/spec/blob/v1.0/spec.md) I am just trying to define a reasonable way of idiomatically encapsulating those existing CloudEvent semantics within the Avro format. (You might notice that I omitted some fields which are arguably redundant when one knows the writer's schema, eg. data content type and data schema). cheers, rog. > On Wed, Dec 18, 2019 at 11:49 AM roger peppe <rogpe...@gmail.com> wrote: > >> Hi, >> >> Background: I've been contemplating the proposed Avro format in the >> CloudEvent >> specification >> <https://github.com/cloudevents/spec/blob/master/avro-format.md>, which >> defines standard metadata for events. It defines a very generic format for >> an event that allows storage of almost any data. It seems to me that by >> going in that direction it's losing almost all the advantages of using Avro >> in the first place. It feels like it's trying to shoehorn a dynamic message >> format like JSON into the Avro format, where using Avro itself could do so >> much better. >> >> I'm hoping to propose something better. I had what I thought was a nice >> idea, but it doesn't *quite* work, and I thought I'd bring up the >> subject here and see if anyone had some better ideas. >> >> The schema resolution >> <https://avro.apache.org/docs/current/spec.html#Schema+Resolution> part >> of the spec allows a reader to read a schema that was written with extra >> fields. So, theoretically, we could define a CloudEvent something like this: >> >> { "name": "CloudEvent", "type": "record", "fields": [{ "name": "Metadata", >> "type": { "type": "record", "name": "CloudEvent", "namespace": " >> avro.apache.org", "fields": [{ "name": "id", "type": "string" }, { "name": >> "source", "type": "string" }, { "name": "time", "type": "long", " >> logicalType": "timestamp-micros" }] } }] } >> >> Theoretically, this could enable any event that's a record that has *at >> least* a Metadata field with the above fields to be read generically. >> The CloudEvent type above could be seen as a structural supertype of all >> possible more-specific CloudEvent-compatible records that have such a >> compatible field. >> >> This has a few nice advantages: >> - there's no need for any wrapping of payload data. >> - the CloudEvent type can evolve over time like any other Avro type. >> - all the data message fields are immediately available alongside the >> metadata. >> - there's still exactly one schema for a topic, encapsulating both the >> metadata and the payload. >> >> However, this idea fails because of one problem - this schema resolution >> rule: "both schemas are records with the same (unqualified) name". This >> means that unless *everyone* names all their CloudEvent-compatible >> records "CloudEvent", they can't be read like this. >> >> I don't think people will be willing to name all their records >> "CloudEvent", so we have a problem. >> >> I can see a few possible workarounds: >> >> 1. when reading the record as a CloudEvent, read it with a schema >> that's the same as CloudEvent, but with the top level record name changed >> to the top level name of the schema that was used to write the record. >> 2. ignore record names when matching schema record types. >> 3. allow aliases to be specified when writing data as well as reading >> it. When defining a CloudEvent-compatible event, you'd add a CloudEvent >> alias to your record. >> >> None of the options are particularly nice. 1 is probably the easiest to >> do, although means you'd still need some custom logic when decoding >> records, meaning you couldn't use stock decoders. >> >> I like the idea of 2, although it gets a bit tricky when dealing with >> union types. You could define the matching such that it ignores names only >> when the two matched types are unambiguous (i.e. only one record in both). >> This could be implemented as an option ("use structural typing") when >> decoding. >> >> 3 is probably cleanest but interacts significantly with the spec (for >> example, the canonical schema transformation strips aliases out, but they'd >> need to be retained). >> >> Any thoughts? Is this a silly thing to be contemplating? Is there a >> better way? >> >> cheers, >> rog. >> >> > > -- > Regards, > > Vance Duncan > mailto:dunca...@gmail.com > http://www.linkedin.com/in/VanceDuncan > (904) 553-5582 >