Re: Recomended naming of types to support for schema evolution

Lee Hambley Wed, 01 Jan 2020 05:48:09 -0800

> I think we're essentially talking about the same stuff in principle, but
> on different levels, my being more low-level avro, while I think you talk
> about confluent "expasion-pack". I do not talk about confluent platform,
> and actually cannot use it at the moment, but that's not important, since
> all of that is possible in plain AVRO.
>


Well confluent or hortonworks, there's two widely used registries, I have
used both interchangably (they are different, but not in basic usage).


> (I) Lets begin with message encoding, you mentioned 3:
>
> 1) message format (5 byte prefix) — *my comment: I believe this is not
> AVRO, but nonstandardized expansion pack. 5B prefix is some hash of schema,
> works with schema registry. OK.*
>

Correct, it's common-enough that I'd be confident in calling it a de-facto
standard.


> *2) *"stream format" — *my comment: again, not avro but confluent
> expansion pack. I know about its existence however I did not see it
> anywhere and don't know how this is actually produced.*
>

https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files - I
totally goofed on the name, this is the object container file. One header,
multiple records.


> 3) "send the schema before every message" — *my comment: not sure what
> that is, but yes, I've heard about strategies of sending 2 kinds of
> messages, where one is "broadcasting" new schemas. Not sure if this has
> trivial support in pure AVRO, in principal it should be possible, but ...*
>

Send schema before message is also
https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files where
number of records = 1.

ok so now lets add 2 strategies from pure avro:
>
> 4) sending just bytes, schema does not change — well this one is obvious
>

yep



> 5) SINGLE OBJECT ENCODING (see [1] at bottom for link) *— comment: this
> is what I've been talking about in previous mail, and in principle it's
> AVRO way of 1) variant. So first 2 bytes are header identifying version of
> header (yep, variant 1 with just hash is incorrect to be honest), then
> followed by schema fingerprint relevant for given header version (currently
> there is just one, but there is possibility for future development, while
> variant 1 does not have this possibility), which is 8B, in total 10B.*
>



> * But it's worthy to see, that variants 1 and 5 are essencially the same,
> just 1 incorrect by design, while 5 is correct by desing. But in principle
> it's binary identifier of schema.*
>

You seem to have something strongly against using the schema registry, but
it solves one very real problem:

How to ship schemas to your readers and writers? In SOE method you know
_which_ schema to use, but I'm not aware of any solution to help with that,
to help parsing the files, and running the signatures, etc. I'm sure
something exists, and it may be fair to say that 5 is a formalized version
of 1, or some other symbiotic relationship where two competing approaches
optimized for different things.
When applying a registry, it can be possible to authenticate producers
(only trusted producers can upload new schemas). And you eliminate a lot of
trouble when deploying consumers, since they don't need to have the current
set of all schemas bundled with their deploy payload, they just need HTTP
access to the registry.

(I'm not trying to push you to a registry, at all, but I have only good
experiences)

(II) quote: There's a concept of "canonicalization" in Avro, so if you have
> "the same" schema, but with fields out of order, or adding new fields, the
> canonical format is very good at keeping like-with-like.
>
> I'm sorry, like-with-like is totally insufficient for me. I need 100%
> same-with-same, otherwise there can be huge problems and financial loss.
> Like-with-like is not acceptable. There must be 100% guarantee, that
> evolution will do just what it should do. You asked what my evolution would
> be: Typically it will be just adding field(s), or renaming fields, lets say
> that we will follow recommendations of confluent platform [2], in more
> detail it is documented in original avro[3], which I believe confluent
> platform just delegates to, but [3] is really harder to read and understand
> in context.
>

I'm not trying to sell you anything, and if Avro is insufficient, then
please, take whatever tech you need. You can't both demand magical schema
resolution behaviour, whilst also demanding to have absolute control.

If you change {namespace: "Foo", name: "Money", ...} then the combination
of `Foo.Money` will be assigned an ID. The schema will be canonicalized
(which is deterministoc and very complete in the spec) and assigned a
version. If you add/remove fields to `Foo.Money` you will increment the
version.

Producers will emit data with `[s1, v1]` (just example) header, and
consumers with a registry supporting library will ALWAYS get that correct
schema.

I'll confess I don't know what happens if you add a field (version+1), and
then remove the same field, if that's technically a new version, or the
prior version, I might experiment with it sometime.


> (III) schema registry to the rescue(?) — ok, so what schema registry does?
> IIUC, it's just a trivial app, which holds all known schema instances, and
> provide REST api, to fetch Schema data by their 5B ID, which deserializer
> obtained from message data. Right?
>

Correct.


> But in (I) I already explained, that `message format` is almost the same
> as `single object encoding`,
>

SOE doesn't say how you should find the right schema given a fingerprint,
so SOE and the schema registry approach are not equvilent, and if you go
SOE you'll need to find or build something like a registry.


> actually I think that avro team just stole confluent serializer and
> retrofit in bact to avro, fixing `message format` problems along the way.
> So in principle if you build a map ID->Schema locally and keep it updated
> somehow, you will end-up having same functionality, just without rest calls.
>

YSK that the registry clients often do a pre-fetch on initialization, and
only actually make HTTP calls on an unknown incoming schema ID/Ver which
should be infrequent enough that you don't need to plan latency/time for
HTTP requests.

Because confluent SchemaRegistry is just that — external map id-> schema,
> which you can read and write. Right? (maybe with potential DoS, when kafka
> topic contains too much messages with schema IDs which schema registry does
> not know about, and it will be flooded by rest requests; off topic). So
> based on that, I don't really understand, how schema registry could change
> the landscape for me; since it's essencially the same, at most it might do
> some thinks in depth a little bit different, which allows to side-step AVRO
> gotchas. And that is what I'm searching for. Pure AVRO solution for schema
> evolution, and recommended way how to do that. The problem I have with pure
> avro solution of this is, that avro schema parse (`
> org.apache.avro.Schema.Parser`) will NOT read 2 schemas with same "name"
> (`org.apache.avro.Schema.Name`), because "name" equality (`
> org.apache.avro.Schema.Name#equals`
> <http://org.apache.avro.Schema.Name#equals>) is defined using fully
> qualified type name, ie. if you have schema like:
>




> The problem with schema identity is general: if you have project, with two
> avsc with same "name" the avro-maven-plugin will fail during compilation.
> That's second reason why I asked about naming scheme: it seems that is
> generally unsupported to have 2 versions of same schema having same
> namespace.name in 1 project. Maybe this is the reason to have version ID
> somewhere in namespace? If you app needs, for whichever reason, to send
> data in 2 different versions?
>

Not familiar enoguh with Java/Maven to know anything about this, sorry.

However the usual approach is to always have the NEWEST schema in the
producer, and then have all the old schemas available for the consumers.
Use a registry, or build something yourself to extend SOE.


> (IV) the part where I lost you. Let me try to explain it then.
>
> I really don't know how this work/should work, as there are close to no
> complete actual examples and documentation does not help much. For example
> if avro schema evolves from v1 to v2,
>
> *ok so you have schema, lets call it `v1` and you add field respecting
> [2]/[3], and you have second schema v2.*
>
> and the type names and nameschema aren't the same, how will be the pairing
> between fields made ?? Completely puzzling.
>
> *ok, so it's not that puzzling, it's explained in [3]. But as explained
> above, schema v1 and v2 won't be able to be parsed using same Schema.Parser
> because of AVRO implementation.*
>

Sounds like a problem that exists because the interface expects you to
define the schema in code *before*. Any code using a schema registry just
needs a registry client, and then a binary payload, and you get back a
decoded object.


> I need no less then schema evolution with backward and forward
> compatibility
>
> *schema evolution is clear I suppose, ie changing original schema to
> something different, and compatibility, backward and forward, is achieved
> by AVRO itself. Deserializer needs to somehow identify writer schema
> (trivially in strategies 1 or 5), the the data are deserialized using
> writer schema, and evolved to desired reader schema. Not sure where/if this
> is documented, but you can check sources: 
> *org.apache.avro.specific.SpecificDatumReader#SpecificDatumReader(org.apache.avro.Schema,
> org.apache.avro.Schema)
>
> Construct given writer's and reader's schema. */
> public SpecificDatumReader(Schema writer, Schema reader) {
>   this(writer, reader, SpecificData.get());
> }
>
>
> with schema reuse (ie. no hacks with top level union, but schema reusing
> using schema imports).
>
> *about "schema reuse": this is my term, as it's not documented
> sufficiently. Sometimes you want to define type, which is referenced
> multiple times from other schema, potentially from different files. Typical
> and superbly ugly and problematic hack (recomended by ... some guys) is to
> define avro schema to have top-level union[4] instead of record, and cram
> everyting into 1 big file. But that is completely wrong. The correct way is
> to define that in separate files, and parse them in correct order or use `*
> <imports>*` in avro-maven-plugin.*
>

Well, there's something else that maybe you should be aware of, the
"superbly ugly hack" is SUPER common. In Kafka it (to my knowledge) is only
really possible to define one message type per topic, so the solution
everyone uses is to define an "empty" type, with just one giant union of
all the other types that can exist on the topic.
FWIW this has worked flawlessly for years at my company with dozens of
schema changes per month across the company common libraries/schemas. But I
agree, it's a shitty hack.



> I think I can hack my way through, by using one parser per set of 1 schema
> of given version and all needed imports, which will make everything working
> (well I don't yet know about anything which will fail), but it completely
> does not feel right. And I would like to know, what is the corret avro way.
> And I suppose it should be possible without confluent schema registry, just
> with single object encoding as I cannot see any difference between them,
> but please correct me if I'm wrong.
>
> *I cannot see anything non-avro being written here. I cannot see here
> anything java-specific, this is all pure avro constructs.*
>
> (V) ad: Maybe it'd help to know what "evolution" you plan, and what type
> names and name schemas you plan to be changing? The "schema evolution" is
> mostly meant to make it easier to add and remove fields from the schemas
> without having to coordinate deploys and juggle iron-clad contract
> interchange formats. It's not meant for wild rewrites of the contract IDLs
> on active running services!
>
> about our usage: we have N running services which currently communicates
> using avro. We need to be able to redeploy services independently, that's
> the reason for backward and forward compatibility: service X upgrades and
> start sending data using upgraded schema, but old services must be able to
> consume them! And after service X is upgraded, there are myriads of records
> produced while it was down for a while, which must be processed. So yes,
> it's just adding/removing column, mostly. This should be working and
> possible using just avro, well, as it's sold on their website. I
> understand, that maybe with confluent bonus-track code it might be working
> correctly, but some of our services cannot use that, so we are stuck with
> plain avro, but that should not be problem; single-object-encoding and
> schema registry should do the same thing and avro should be working even
> without confluent platform.
>

This makes perfect sense, so let me try and close this email out with the
following points that I believe are important:

- schema is defined by namespace+name for the rest of these bulletpoints
- if you rename the schema {'namespace': 'money.v1', name: "activity"}, to
{'namespace': 'money.v2', name: "activity"} this is a new schema, no
evolution of anything will help you
- typically producers (writers in Avro parlance) will always have the
newest schema
- for SOE or registry approaches (anything that doesn't send the whole
schema in a prefix header) the consumers (Readers in avro parlance) will
need to access prior schemas, I guess if you use SOE you need to find a way
to compile prior schemas in, and then have multiple instances of your Java
parser class, and put them all in a big map with their computed
fingerprint, so you can find the right one.
- with the registry the producer would put the newest schema in the
registry the first time they produce, consumers will use the ID+ver to look
up the correct version in the registry
- consumers don't need to be compiled with schemas from files
- schema resolution is DETERMINISTIC, but you still need to define what
happens when a consumer receives an unknown field in a "newer" schema that
maybe doesn't have defaults (or implement a policy in your coding workflow
to say "all evolved fields must have defaults" to simplify)

I'd also say that you should _maybe_ avoid the confluent docs, I've never
read them, I avoid confluent anything like the plague after being burned by
some of their shitty practices in a previous job. The horton works registry
is compatible with every lib I've ever tried (node.js, ruby, go, rust) and
I assume teh same is true for Java. There's virtially no documentation here
because what it's doing is TRIVIAL.

Finally, if there was a registry+SOE solution (and, maybe there is) that
relies on the 10b header and fingerprint, rather than the canonical form
JSON and some "IDs", I agree that would be preferable, but it's still
essentially gonig to be an HTTP+rest "shared service" between producers and
consumers.

Your usecase makes sense, it's the same thing all of the rest of us are
doing with schema evolution, at this point I'd recommend that you just
experiment for a few hours one afternoon until you feel comfortable.

This really isn't terribly complicated, and your heightenend concernes
about finances/etc are commendable, but unwarranted, a lot of our
multi-million EUR turnover in my current role is handled with Avro, and
we've never skipped a beat.

People in our team occasionally wish for something simpler, and we look at
protob, or msgpack, and invariably arrive back at "ohh, right, avro makes
the most sense"

Not sure how much more I can hep, I seem to be advocating for a solution
you have decided to avoid, but I hope at least from this mail the specifics
about SOE and "message format" prefixes, and how that influences
producer/consumer design and deploy helps.

Regards,

Re: Recomended naming of types to support for schema evolution

Reply via email to