Hi guys, thanks for answers, I really appreciate it. There were a lots of letters typed just for me, thanks!
Preword: There were some misunderstandings in my text, so I'll try to be more specific and reference documentation/sources. And I have to apologize beforehand, sometimes I have too confrontational "tone", please don't take it bad if you don't like the way I'm saying something. Lets go. ——— I think we're essentially talking about the same stuff in principle, but on different levels, my being more low-level avro, while I think you talk about confluent "expasion-pack". I do not talk about confluent platform, and actually cannot use it at the moment, but that's not important, since all of that is possible in plain AVRO. (I) Lets begin with message encoding, you mentioned 3: 1) message format (5 byte prefix) — *my comment: I believe this is not AVRO, but nonstandardized expansion pack. 5B prefix is some hash of schema, works with schema registry. OK.* *2) *"stream format" — *my comment: again, not avro but confluent expansion pack. I know about its existence however I did not see it anywhere and don't know how this is actually produced.* 3) "send the schema before every message" — *my comment: not sure what that is, but yes, I've heard about strategies of sending 2 kinds of messages, where one is "broadcasting" new schemas. Not sure if this has trivial support in pure AVRO, in principal it should be possible, but ...* ok so now lets add 2 strategies from pure avro: 4) sending just bytes, schema does not change — well this one is obvious 5) SINGLE OBJECT ENCODING (see [1] at bottom for link) *— comment: this is what I've been talking about in previous mail, and in principle it's AVRO way of 1) variant. So first 2 bytes are header identifying version of header (yep, variant 1 with just hash is incorrect to be honest), then followed by schema fingerprint relevant for given header version (currently there is just one, but there is possibility for future development, while variant 1 does not have this possibility), which is 8B, in total 10B. But it's worthy to see, that variants 1 and 5 are essencially the same, just 1 incorrect by design, while 5 is correct by desing. But in principle it's binary identifier of schema.* (II) quote: There's a concept of "canonicalization" in Avro, so if you have "the same" schema, but with fields out of order, or adding new fields, the canonical format is very good at keeping like-with-like. I'm sorry, like-with-like is totally insufficient for me. I need 100% same-with-same, otherwise there can be huge problems and financial loss. Like-with-like is not acceptable. There must be 100% guarantee, that evolution will do just what it should do. You asked what my evolution would be: Typically it will be just adding field(s), or renaming fields, lets say that we will follow recommendations of confluent platform [2], in more detail it is documented in original avro[3], which I believe confluent platform just delegates to, but [3] is really harder to read and understand in context. (III) schema registry to the rescue(?) — ok, so what schema registry does? IIUC, it's just a trivial app, which holds all known schema instances, and provide REST api, to fetch Schema data by their 5B ID, which deserializer obtained from message data. Right? But in (I) I already explained, that `message format` is almost the same as `single object encoding`, actually I think that avro team just stole confluent serializer and retrofit in bact to avro, fixing `message format` problems along the way. So in principle if you build a map ID->Schema locally and keep it updated somehow, you will end-up having same functionality, just without rest calls. Because confluent SchemaRegistry is just that — external map id-> schema, which you can read and write. Right? (maybe with potential DoS, when kafka topic contains too much messages with schema IDs which schema registry does not know about, and it will be flooded by rest requests; off topic). So based on that, I don't really understand, how schema registry could change the landscape for me; since it's essencially the same, at most it might do some thinks in depth a little bit different, which allows to side-step AVRO gotchas. And that is what I'm searching for. Pure AVRO solution for schema evolution, and recommended way how to do that. The problem I have with pure avro solution of this is, that avro schema parse (` org.apache.avro.Schema.Parser`) will NOT read 2 schemas with same "name" (` org.apache.avro.Schema.Name`), because "name" equality (` org.apache.avro.Schema.Name#equals`) is defined using fully qualified type name, ie. if you have schema like: { "namespace": "avroTest", "type": "record", "name": "Money", "fields": [ { "name": "value", "type": [ "null", "string" ] } ] } the fully qualified name would be "avroTest.Money". And if you create new version, and add new field, but do not change the namespace or name, the fully qualified name will be the same, and 1 instance of ` org.apache.avro.Schema.Parser` will not be able to parse both of these versions, producing: `throw new SchemaParseException("Can't redefine: "+name);` This is why I asked about recommended naming scheme. Because this limitation exist. So then you can only have 1 parser per schema or different "name". The problem with schema identity is general: if you have project, with two avsc with same "name" the avro-maven-plugin will fail during compilation. That's second reason why I asked about naming scheme: it seems that is generally unsupported to have 2 versions of same schema having same namespace.name in 1 project. Maybe this is the reason to have version ID somewhere in namespace? If you app needs, for whichever reason, to send data in 2 different versions? (IV) the part where I lost you. Let me try to explain it then. I really don't know how this work/should work, as there are close to no complete actual examples and documentation does not help much. For example if avro schema evolves from v1 to v2, *ok so you have schema, lets call it `v1` and you add field respecting [2]/[3], and you have second schema v2.* and the type names and nameschema aren't the same, how will be the pairing between fields made ?? Completely puzzling. *ok, so it's not that puzzling, it's explained in [3]. But as explained above, schema v1 and v2 won't be able to be parsed using same Schema.Parser because of AVRO implementation.* I need no less then schema evolution with backward and forward compatibility *schema evolution is clear I suppose, ie changing original schema to something different, and compatibility, backward and forward, is achieved by AVRO itself. Deserializer needs to somehow identify writer schema (trivially in strategies 1 or 5), the the data are deserialized using writer schema, and evolved to desired reader schema. Not sure where/if this is documented, but you can check sources: *org.apache.avro.specific.SpecificDatumReader#SpecificDatumReader(org.apache.avro.Schema, org.apache.avro.Schema) Construct given writer's and reader's schema. */ public SpecificDatumReader(Schema writer, Schema reader) { this(writer, reader, SpecificData.get()); } with schema reuse (ie. no hacks with top level union, but schema reusing using schema imports). *about "schema reuse": this is my term, as it's not documented sufficiently. Sometimes you want to define type, which is referenced multiple times from other schema, potentially from different files. Typical and superbly ugly and problematic hack (recomended by ... some guys) is to define avro schema to have top-level union[4] instead of record, and cram everyting into 1 big file. But that is completely wrong. The correct way is to define that in separate files, and parse them in correct order or use `* <imports>*` in avro-maven-plugin.* I think I can hack my way through, by using one parser per set of 1 schema of given version and all needed imports, which will make everything working (well I don't yet know about anything which will fail), but it completely does not feel right. And I would like to know, what is the corret avro way. And I suppose it should be possible without confluent schema registry, just with single object encoding as I cannot see any difference between them, but please correct me if I'm wrong. *I cannot see anything non-avro being written here. I cannot see here anything java-specific, this is all pure avro constructs.* (V) ad: Maybe it'd help to know what "evolution" you plan, and what type names and name schemas you plan to be changing? The "schema evolution" is mostly meant to make it easier to add and remove fields from the schemas without having to coordinate deploys and juggle iron-clad contract interchange formats. It's not meant for wild rewrites of the contract IDLs on active running services! about our usage: we have N running services which currently communicates using avro. We need to be able to redeploy services independently, that's the reason for backward and forward compatibility: service X upgrades and start sending data using upgraded schema, but old services must be able to consume them! And after service X is upgraded, there are myriads of records produced while it was down for a while, which must be processed. So yes, it's just adding/removing column, mostly. This should be working and possible using just avro, well, as it's sold on their website. I understand, that maybe with confluent bonus-track code it might be working correctly, but some of our services cannot use that, so we are stuck with plain avro, but that should not be problem; single-object-encoding and schema registry should do the same thing and avro should be working even without confluent platform. M. Links: [1] https://avro.apache.org/docs/current/spec.html#single_object_encoding [2] https://docs.confluent.io/current/schema-registry/avro.html [3] https://avro.apache.org/docs/current/spec.html#Schema+Resolution [4] https://avro.apache.org/docs/current/spec.html#Unions út 31. 12. 2019 v 20:49 odesílatel Lee Hambley <lee.hamb...@gmail.com> napsal: > Hi Martin, > > Vance already said it all, but let me see if I can elaborate a bit. > > I don't understand avro sufficiently and don't know schema registry at >> all, actually. So maybe following questions will be dumb. >> >> a) how is schema registry with 5B header different from single object >> encoding with 10B header? >> > > Not sure what this 10b header is. Broadly speaking there's three ways to > send header into with avro plus the secret 4th way (don't, schema doesn't > change, reader and writer both have it). > > 1. Message format (5 byte prefix, with a schema registry, header carries > just the schema/version lookup into for the registry) > 2 Stream format (?) (naming is for sure wrong, this sends the schema > before sending any records, then an arbitrary (unlimited?) number of > recorrds, useful for archiving homogeneous data) > 3. send the schema before every message (might be flexible, but could > negate any bandwidth savings) > > all three of these have names, and they're all recommended in certain > circumstances, even without going deep, and with my weak executive > summaries, I believe you could already imagine how they might be usef. > > b) will schema registry somehow relieve me from having to parse individual >> schemas? What if I want to/have to send 2 different version of certain >> schema? >> > > There's a concept of "canonicalization" in Avro, so if you have "the same" > schema, but with fields out of order, or adding new fields, the canonical > format is very good at keeping like-with-like. Libraries for the registry > will absolve you of doing any parsing. > > Usually you configure a writer (producer, whatever) with a registry URL, > and a payload in a map/hash, and the "current" schemas, the library you use > will canonicalize the schema, make sure it exists in the registry, and emit > a binary avro payload referencing the schema. > > The reader needs no local schema files, it will receive a message with a > 5b prefix and will look up that schema at that version in the registry, and > will give you back a hash/map with the data. If you added a field to the > producer before adding it to the consumer you may have an extra member in > the map that you don't know how to handle yet, or you might have an empty > value that you don't know how to deal with if the consumer "knows" more > fields than the consumer. > > You solve this "problem" with the regular approach you would in any code > with untrusted data > > > >> c) actually what I have here is (seemingly) pretty similar setup (and >> btw, which was recommended here as an alternative to confluent schema >> registry): it's a registry without an extra service. Trivial map mapping >> single object encoding long[data type] schema fingerprint, pairing schema >> fingerprint to schema. So when the bytes "arrive" I can easily read header, >> find out fingerprint, get hold onto schema and decode it. Trivial. But the >> snag is, that single Schema.Names instance can contain just one Name of >> given "identity", and equality is based on fully qualified type, ie. >> namespace and name. Thus if you have schema in 2 versions, which does have >> same namespace and name, they cannot be parsed using same Parser. Does >> schema registry (from confluent platform, right?) work differently than >> this? Does this "use it for decoding" process bypasses avros new >> Schema.Parser().parse and everything beneath it? >> > > IT's not idiomatic to put "v2" or anything in the schema namespace, unless > someone is coaching you to avoid the schema registry approach (which as a > few of us have mentioned, is one principle reason to use avro). I've been > in a company who have a v2 namespace in avro, but it's the last "v" we'll > ever have. In v1 we didn't use a schema registry, in v2 we do, and the > registrty ensures readers and writers can always talk, and we just need to > be mindful of > > FWIW we have one schema registry in each of our environments (prod, > staging, qa), in retrospect we think this might have been a mistake, as for > e.g the QA env doesn't keep any history, so we often fail to test older > payloads in our test environments, but tbh it hasn't caused any _real_ > problems yet, but it's something I would consider approaching with a global > registry (fed by my CI system?) in the future. > > >> ~ I really don't know how this work/should work, as there are close to no >> complete actual examples and documentation does not help much. For example >> if avro schema evolves from v1 to v2, and the type names and nameschema >> aren't the same, how will be the pairing between fields made ?? Completely >> puzzling. I need no less then schema evolution with backward and forward >> compatibility with schema reuse (ie. no hacks with top level union, but >> schema reusing using schema imports). I think I can hack my way through, by >> using one parser per set of 1 schema of given version and all needed >> imports, which will make everything working (well I don't yet know about >> anything which will fail), but it completely does not feel right. And I >> would like to know, what is the corret avro way. And I suppose it should be >> possible without confluent schema registry, just with single object >> encoding as I cannot see any difference between them, but please correct me >> if I'm wrong. >> > > You lost me here, I think you're maybe crossing some vocabulary from your > language stack, not from Avro per-se, but I'm coming at Avro from Ruby and > Node (yikes.) and have never used any JVM language integration, so assume > this is ignorance on my part. > > Maybe it'd help to know what "evolution" you plan, and what type names and > name schemas you plan to be changing? The "schema evolution" is mostly > meant to make it easier to add and remove fields from the schemas without > having to coordinate deploys and juggle iron-clad contract interchange > formats. It's not meant for wild rewrites of the contract IDLs on active > running services! > > All the best for 2020, anyone else who happens to be reading mailing list > emails this NYE! > > >> thanks, >> Mar. >> >> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley <lee.hamb...@gmail.com> >> napsal: >> >>> Hi Martin, >>> >>> I believe the answer is "just use the schema registry". When you then >>> encode for the network your library should give you a binary package with a >>> 5 byte header that includes the schema version and name from the registry. >>> The reader will when go to the registry and find that schema at that >>> version and use it for decoding. >>> >>> In my experience the naming/etc doesn't matter, only things like >>> defaults in enums and things need to be given a thought, but you'll see >>> that for yourself with experience. >>> >>> HTH, Regards, >>> >>> Lee Hambley >>> http://lee.hambley.name/ >>> +49 (0) 170 298 5667 >>> >>> >>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha <alfon...@gmail.com> wrote: >>> >>>> Hi, >>>> I'm relatively new to avro, and I'm still struggling with getting >>>> schema evolution and related issues. But today it should be simple >>>> question. >>>> >>>> What is recommended naming of types if we want to use schema evolution? >>>> Should namespace contain some information about version of schema? Or >>>> should it be in type itself? Or neither? What is the best practice? Is >>>> evolution even possible if namespace/type name is different? >>>> >>>> I thought that "neither" it's the case, built the app so that version >>>> ID is nowhere except for the directory structure, only latest version is >>>> compiled to java classes using maven plugin, and parsed all other avsc >>>> files in code (to be able to build some sort of schema registry, identify >>>> used writer schema using single object encoding and use schema evolution). >>>> However I used separate Parser instance to parse each schema. But if one >>>> would like to use schema imports, he cannot have separate parser for every >>>> schema, and having global one in this setup is also not possible, as each >>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I >>>> favored this variant(ie. no ID in name/namespace) because in this setup, >>>> after I introduce new schema version, I do not have to change imports in >>>> whole project, but just one line in pom.xml saying which directory should >>>> be compiled into java files. >>>> >>>> so what could be the suggestion to correct naming-versioning scheme? >>>> thanks, >>>> M. >>>> >>>