Re: Recomended naming of types to support for schema evolution

Martin Mucha Wed, 01 Jan 2020 02:28:42 -0800

Hi guys, thanks for answers, I really appreciate it. There were a lots of
letters typed just for me, thanks!

Preword: There were some misunderstandings in my text, so I'll try to be
more specific and reference documentation/sources. And I have to apologize
beforehand, sometimes I have too confrontational "tone", please don't take
it bad if you don't like the way I'm saying something. Lets go.
———

I think we're essentially talking about the same stuff in principle, but on
different levels, my being more low-level avro, while I think you talk
about confluent "expasion-pack". I do not talk about confluent platform,
and actually cannot use it at the moment, but that's not important, since
all of that is possible in plain AVRO.

(I) Lets begin with message encoding, you mentioned 3:

1) message format (5 byte prefix) — *my comment: I believe this is not
AVRO, but nonstandardized expansion pack. 5B prefix is some hash of schema,
works with schema registry. OK.*
*2) *"stream format" — *my comment: again, not avro but confluent expansion
pack. I know about its existence however I did not see it anywhere and
don't know how this is actually produced.*
3) "send the schema before every message" — *my comment: not sure what that
is, but yes, I've heard about strategies of sending 2 kinds of messages,
where one is "broadcasting" new schemas. Not sure if this has trivial
support in pure AVRO, in principal it should be possible, but ...*

ok so now lets add 2 strategies from pure avro:

4) sending just bytes, schema does not change — well this one is obvious
5) SINGLE OBJECT ENCODING (see [1] at bottom for link) *— comment: this is
what I've been talking about in previous mail, and in principle it's AVRO
way of 1) variant. So first 2 bytes are header identifying version of
header (yep, variant 1 with just hash is incorrect to be honest), then
followed by schema fingerprint relevant for given header version (currently
there is just one, but there is possibility for future development, while
variant 1 does not have this possibility), which is 8B, in total 10B. But
it's worthy to see, that variants 1 and 5 are essencially the same, just 1
incorrect by design, while 5 is correct by desing. But in principle it's
binary identifier of schema.*

(II) quote: There's a concept of "canonicalization" in Avro, so if you have
"the same" schema, but with fields out of order, or adding new fields, the
canonical format is very good at keeping like-with-like.

I'm sorry, like-with-like is totally insufficient for me. I need 100%
same-with-same, otherwise there can be huge problems and financial loss.
Like-with-like is not acceptable. There must be 100% guarantee, that
evolution will do just what it should do. You asked what my evolution would
be: Typically it will be just adding field(s), or renaming fields, lets say
that we will follow recommendations of confluent platform [2], in more
detail it is documented in original avro[3], which I believe confluent
platform just delegates to, but [3] is really harder to read and understand
in context.

(III) schema registry to the rescue(?) — ok, so what schema registry does?
IIUC, it's just a trivial app, which holds all known schema instances, and
provide REST api, to fetch Schema data by their 5B ID, which deserializer
obtained from message data. Right? But in (I) I already explained, that
`message format` is almost the same as `single object encoding`, actually I
think that avro team just stole confluent serializer and retrofit in bact
to avro, fixing `message format` problems along the way. So in principle if
you build a map ID->Schema locally and keep it updated somehow, you will
end-up having same functionality, just without rest calls. Because
confluent SchemaRegistry is just that — external map id-> schema, which you
can read and write. Right? (maybe with potential DoS, when kafka topic
contains too much messages with schema IDs which schema registry does not
know about, and it will be flooded by rest requests; off topic). So based
on that, I don't really understand, how schema registry could change the
landscape for me; since it's essencially the same, at most it might do some
thinks in depth a little bit different, which allows to side-step AVRO
gotchas. And that is what I'm searching for. Pure AVRO solution for schema
evolution, and recommended way how to do that. The problem I have with pure
avro solution of this is, that avro schema parse (`
org.apache.avro.Schema.Parser`) will NOT read 2 schemas with same "name" (`
org.apache.avro.Schema.Name`), because "name" equality (`
org.apache.avro.Schema.Name#equals`) is defined using fully qualified type
name, ie. if you have schema like:

{
  "namespace": "avroTest",
  "type": "record",
  "name": "Money",
  "fields": [
    {
      "name": "value",
      "type": [
        "null",
        "string"
      ]
    }
  ]
}

the fully qualified name would be "avroTest.Money". And if you create new
version, and add new field, but do not change the namespace or name, the
fully qualified name will be the same, and 1 instance of `
org.apache.avro.Schema.Parser` will not be able to parse both of these
versions, producing: `throw new SchemaParseException("Can't redefine:
"+name);` This is why I asked about recommended naming scheme. Because this
limitation exist. So then you can only have 1 parser per schema or
different "name".

The problem with schema identity is general: if you have project, with two
avsc with same "name" the avro-maven-plugin will fail during compilation.
That's second reason why I asked about naming scheme: it seems that is
generally unsupported to have 2 versions of same schema having same
namespace.name in 1 project. Maybe this is the reason to have version ID
somewhere in namespace? If you app needs, for whichever reason, to send
data in 2 different versions?

(IV) the part where I lost you. Let me try to explain it then.

I really don't know how this work/should work, as there are close to no
complete actual examples and documentation does not help much. For example
if avro schema evolves from v1 to v2,

*ok so you have schema, lets call it `v1` and you add field respecting
[2]/[3], and you have second schema v2.*

and the type names and nameschema aren't the same, how will be the pairing
between fields made ?? Completely puzzling.

*ok, so it's not that puzzling, it's explained in [3]. But as explained
above, schema v1 and v2 won't be able to be parsed using same Schema.Parser
because of AVRO implementation.*

I need no less then schema evolution with backward and forward
compatibility

*schema evolution is clear I suppose, ie changing original schema to
something different, and compatibility, backward and forward, is achieved
by AVRO itself. Deserializer needs to somehow identify writer schema
(trivially in strategies 1 or 5), the the data are deserialized using
writer schema, and evolved to desired reader schema. Not sure where/if this
is documented, but you can check sources:
*org.apache.avro.specific.SpecificDatumReader#SpecificDatumReader(org.apache.avro.Schema,
org.apache.avro.Schema)

Construct given writer's and reader's schema. */
public SpecificDatumReader(Schema writer, Schema reader) {
  this(writer, reader, SpecificData.get());
}

with schema reuse (ie. no hacks with top level union, but schema reusing
using schema imports).

*about "schema reuse": this is my term, as it's not documented
sufficiently. Sometimes you want to define type, which is referenced
multiple times from other schema, potentially from different files. Typical
and superbly ugly and problematic hack (recomended by ... some guys) is to
define avro schema to have top-level union[4] instead of record, and cram
everyting into 1 big file. But that is completely wrong. The correct way is
to define that in separate files, and parse them in correct order or use `*
<imports>*` in avro-maven-plugin.*

I think I can hack my way through, by using one parser per set of 1 schema
of given version and all needed imports, which will make everything working
(well I don't yet know about anything which will fail), but it completely
does not feel right. And I would like to know, what is the corret avro way.
And I suppose it should be possible without confluent schema registry, just
with single object encoding as I cannot see any difference between them,
but please correct me if I'm wrong.

*I cannot see anything non-avro being written here. I cannot see here
anything java-specific, this is all pure avro constructs.*

(V) ad: Maybe it'd help to know what "evolution" you plan, and what type
names and name schemas you plan to be changing? The "schema evolution" is
mostly meant to make it easier to add and remove fields from the schemas
without having to coordinate deploys and juggle iron-clad contract
interchange formats. It's not meant for wild rewrites of the contract IDLs
on active running services!

about our usage: we have N running services which currently communicates
using avro. We need to be able to redeploy services independently, that's
the reason for backward and forward compatibility: service X upgrades and
start sending data using upgraded schema, but old services must be able to
consume them! And after service X is upgraded, there are myriads of records
produced while it was down for a while, which must be processed. So yes,
it's just adding/removing column, mostly. This should be working and
possible using just avro, well, as it's sold on their website. I
understand, that maybe with confluent bonus-track code it might be working
correctly, but some of our services cannot use that, so we are stuck with
plain avro, but that should not be problem; single-object-encoding and
schema registry should do the same thing and avro should be working even
without confluent platform.

M.

Links:
[1]  https://avro.apache.org/docs/current/spec.html#single_object_encoding
[2] https://docs.confluent.io/current/schema-registry/avro.html
[3] https://avro.apache.org/docs/current/spec.html#Schema+Resolution
[4] https://avro.apache.org/docs/current/spec.html#Unions

út 31. 12. 2019 v 20:49 odesílatel Lee Hambley <lee.hamb...@gmail.com>
napsal:

> Hi Martin,
>
> Vance already said it all, but let me see if I can elaborate a bit.
>
> I don't understand avro sufficiently and don't know schema registry at
>> all, actually. So maybe following questions will be dumb.
>>
>> a) how is schema registry with 5B header different from single object
>> encoding with 10B header?
>>
>
> Not sure what this 10b header is. Broadly speaking there's three ways to
> send header into with avro plus the secret 4th way (don't, schema doesn't
> change, reader and writer both have it).
>
> 1. Message format (5 byte prefix, with a schema registry, header carries
> just the schema/version lookup into for the registry)
> 2 Stream format (?) (naming is for sure wrong, this sends the schema
> before sending any records, then an arbitrary (unlimited?) number of
> recorrds, useful for archiving homogeneous data)
> 3. send the schema before every message (might be flexible, but could
> negate any bandwidth savings)
>
> all three of these have names, and they're all recommended in certain
> circumstances, even without going deep, and with my weak executive
> summaries, I believe you could already imagine how they might be usef.
>
> b) will schema registry somehow relieve me from having to parse individual
>> schemas? What if I want to/have to send 2 different version of certain
>> schema?
>>
>
> There's a concept of "canonicalization" in Avro, so if you have "the same"
> schema, but with fields out of order, or adding new fields, the canonical
> format is very good at keeping like-with-like. Libraries for the registry
> will absolve you of doing any parsing.
>
> Usually you configure a writer (producer, whatever) with a registry URL,
> and a payload in a map/hash, and the "current" schemas, the library you use
> will canonicalize the schema, make sure it exists in the registry, and emit
> a binary avro payload referencing the schema.
>
> The reader needs no local schema files, it will receive a message with a
> 5b prefix and will look up that schema at that version in the registry, and
> will give you back a hash/map with the data. If you added a field to the
> producer before adding it to the consumer you may have an extra member in
> the map that you don't know how to handle yet, or you might have an empty
> value that you don't know how to deal with if the consumer "knows" more
> fields than the consumer.
>
> You solve this "problem" with the regular approach you would in any code
> with untrusted data
>
>
>
>> c) actually what I have here is (seemingly) pretty similar setup (and
>> btw, which was recommended here as an alternative to confluent schema
>> registry): it's a registry without an extra service. Trivial map mapping
>> single object encoding long[data type] schema fingerprint, pairing schema
>> fingerprint to schema. So when the bytes "arrive" I can easily read header,
>> find out fingerprint, get hold onto schema and decode it. Trivial. But the
>> snag is, that single Schema.Names instance can contain just one Name of
>> given "identity", and equality is based on fully qualified type, ie.
>> namespace and name. Thus if you have schema in 2 versions, which does have
>> same namespace and name, they cannot be parsed using same Parser. Does
>> schema registry (from confluent platform, right?) work differently than
>> this? Does this "use it for decoding" process bypasses avros new
>> Schema.Parser().parse and everything beneath it?
>>
>
> IT's not idiomatic to put "v2" or anything in the schema namespace, unless
> someone is coaching you to avoid the schema registry approach (which as a
> few of us have mentioned, is one principle reason to use avro). I've been
> in a company who have a v2 namespace in avro, but it's the last "v" we'll
> ever have. In v1 we didn't use a schema registry, in v2 we do, and the
> registrty ensures readers and writers can always talk, and we just need to
> be mindful of
>
> FWIW we have one schema registry in each of our environments (prod,
> staging, qa), in retrospect we think this might have been a mistake, as for
> e.g the QA env doesn't keep any history, so we often fail to test older
> payloads in our test environments, but tbh it hasn't caused any _real_
> problems yet, but it's something I would consider approaching with a global
> registry (fed by my CI system?) in the future.
>
>
>> ~ I really don't know how this work/should work, as there are close to no
>> complete actual examples and documentation does not help much. For example
>> if avro schema evolves from v1 to v2, and the type names and nameschema
>> aren't the same, how will be the pairing between fields made ?? Completely
>> puzzling. I need no less then schema evolution with backward and forward
>> compatibility with schema reuse (ie. no hacks with top level union, but
>> schema reusing using schema imports). I think I can hack my way through, by
>> using one parser per set of 1 schema of given version and all needed
>> imports, which will make everything working (well I don't yet know about
>> anything which will fail), but it completely does not feel right. And I
>> would like to know, what is the corret avro way. And I suppose it should be
>> possible without confluent schema registry, just with single object
>> encoding as I cannot see any difference between them, but please correct me
>> if I'm wrong.
>>
>
> You lost me here, I think you're maybe crossing some vocabulary from your
> language stack, not from Avro per-se, but I'm coming at Avro from Ruby and
> Node (yikes.) and have never used any JVM language integration, so assume
> this is ignorance on my part.
>
> Maybe it'd help to know what "evolution" you plan, and what type names and
> name schemas you plan to be changing? The "schema evolution" is mostly
> meant to make it easier to add and remove fields from the schemas without
> having to coordinate deploys and juggle iron-clad contract interchange
> formats. It's not meant for wild rewrites of the contract IDLs on active
> running services!
>
> All the best for 2020, anyone else who happens to be reading mailing list
> emails this NYE!
>
>
>> thanks,
>> Mar.
>>
>> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley <lee.hamb...@gmail.com>
>> napsal:
>>
>>> Hi Martin,
>>>
>>> I believe the answer is "just use the schema registry". When you then
>>> encode for the network your library should give you a binary package with a
>>> 5 byte header that includes the schema version and name from the registry.
>>> The reader will when go to the registry and find that schema at that
>>> version and use it for decoding.
>>>
>>> In my experience the naming/etc doesn't matter, only things like
>>> defaults in enums and things need to be given a thought, but you'll see
>>> that for yourself with experience.
>>>
>>> HTH, Regards,
>>>
>>> Lee Hambley
>>> http://lee.hambley.name/
>>> +49 (0) 170 298 5667
>>>
>>>
>>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha <alfon...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I'm relatively new to avro, and I'm still struggling with getting
>>>> schema evolution and related issues. But today it should be simple 
>>>> question.
>>>>
>>>> What is recommended naming of types if we want to use schema evolution?
>>>> Should namespace contain some information about version of schema? Or
>>>> should it be in type itself? Or neither? What is the best practice? Is
>>>> evolution even possible if namespace/type name is different?
>>>>
>>>> I thought that "neither" it's the case, built the app so that version
>>>> ID is nowhere except for the directory structure, only latest version is
>>>> compiled to java classes using maven plugin, and parsed all other avsc
>>>> files in code (to be able to build some sort of schema registry, identify
>>>> used writer schema using single object encoding and use schema evolution).
>>>> However I used separate Parser instance to parse each schema. But if one
>>>> would like to use schema imports, he cannot have separate parser for every
>>>> schema, and having global one in this setup is also not possible, as each
>>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I
>>>> favored this variant(ie. no ID in name/namespace) because in this setup,
>>>> after I introduce new schema version, I do not have to change imports in
>>>> whole project, but just one line in pom.xml saying which directory should
>>>> be compiled into java files.
>>>>
>>>> so what could be the suggestion to correct naming-versioning scheme?
>>>> thanks,
>>>> M.
>>>>
>>>

Re: Recomended naming of types to support for schema evolution

Reply via email to