Re: Recomended naming of types to support for schema evolution

Lee Hambley Tue, 31 Dec 2019 11:49:32 -0800

Hi Martin,

Vance already said it all, but let me see if I can elaborate a bit.

I don't understand avro sufficiently and don't know schema registry at all,
> actually. So maybe following questions will be dumb.
>
> a) how is schema registry with 5B header different from single object
> encoding with 10B header?
>

Not sure what this 10b header is. Broadly speaking there's three ways to
send header into with avro plus the secret 4th way (don't, schema doesn't
change, reader and writer both have it).

1. Message format (5 byte prefix, with a schema registry, header carries
just the schema/version lookup into for the registry)
2 Stream format (?) (naming is for sure wrong, this sends the schema before
sending any records, then an arbitrary (unlimited?) number of recorrds,
useful for archiving homogeneous data)
3. send the schema before every message (might be flexible, but could
negate any bandwidth savings)

all three of these have names, and they're all recommended in certain
circumstances, even without going deep, and with my weak executive
summaries, I believe you could already imagine how they might be usef.

b) will schema registry somehow relieve me from having to parse individual
> schemas? What if I want to/have to send 2 different version of certain
> schema?
>

There's a concept of "canonicalization" in Avro, so if you have "the same"
schema, but with fields out of order, or adding new fields, the canonical
format is very good at keeping like-with-like. Libraries for the registry
will absolve you of doing any parsing.

Usually you configure a writer (producer, whatever) with a registry URL,
and a payload in a map/hash, and the "current" schemas, the library you use
will canonicalize the schema, make sure it exists in the registry, and emit
a binary avro payload referencing the schema.

The reader needs no local schema files, it will receive a message with a 5b
prefix and will look up that schema at that version in the registry, and
will give you back a hash/map with the data. If you added a field to the
producer before adding it to the consumer you may have an extra member in
the map that you don't know how to handle yet, or you might have an empty
value that you don't know how to deal with if the consumer "knows" more
fields than the consumer.

You solve this "problem" with the regular approach you would in any code
with untrusted data

> c) actually what I have here is (seemingly) pretty similar setup (and btw,
> which was recommended here as an alternative to confluent schema registry):
> it's a registry without an extra service. Trivial map mapping single object
> encoding long[data type] schema fingerprint, pairing schema fingerprint to
> schema. So when the bytes "arrive" I can easily read header, find out
> fingerprint, get hold onto schema and decode it. Trivial. But the snag is,
> that single Schema.Names instance can contain just one Name of given
> "identity", and equality is based on fully qualified type, ie. namespace
> and name. Thus if you have schema in 2 versions, which does have same
> namespace and name, they cannot be parsed using same Parser. Does schema
> registry (from confluent platform, right?) work differently than this? Does
> this "use it for decoding" process bypasses avros new Schema.Parser().parse
> and everything beneath it?
>

IT's not idiomatic to put "v2" or anything in the schema namespace, unless
someone is coaching you to avoid the schema registry approach (which as a
few of us have mentioned, is one principle reason to use avro). I've been
in a company who have a v2 namespace in avro, but it's the last "v" we'll
ever have. In v1 we didn't use a schema registry, in v2 we do, and the
registrty ensures readers and writers can always talk, and we just need to
be mindful of

FWIW we have one schema registry in each of our environments (prod,
staging, qa), in retrospect we think this might have been a mistake, as for
e.g the QA env doesn't keep any history, so we often fail to test older
payloads in our test environments, but tbh it hasn't caused any _real_
problems yet, but it's something I would consider approaching with a global
registry (fed by my CI system?) in the future.

> ~ I really don't know how this work/should work, as there are close to no
> complete actual examples and documentation does not help much. For example
> if avro schema evolves from v1 to v2, and the type names and nameschema
> aren't the same, how will be the pairing between fields made ?? Completely
> puzzling. I need no less then schema evolution with backward and forward
> compatibility with schema reuse (ie. no hacks with top level union, but
> schema reusing using schema imports). I think I can hack my way through, by
> using one parser per set of 1 schema of given version and all needed
> imports, which will make everything working (well I don't yet know about
> anything which will fail), but it completely does not feel right. And I
> would like to know, what is the corret avro way. And I suppose it should be
> possible without confluent schema registry, just with single object
> encoding as I cannot see any difference between them, but please correct me
> if I'm wrong.
>

You lost me here, I think you're maybe crossing some vocabulary from your
language stack, not from Avro per-se, but I'm coming at Avro from Ruby and
Node (yikes.) and have never used any JVM language integration, so assume
this is ignorance on my part.

Maybe it'd help to know what "evolution" you plan, and what type names and
name schemas you plan to be changing? The "schema evolution" is mostly
meant to make it easier to add and remove fields from the schemas without
having to coordinate deploys and juggle iron-clad contract interchange
formats. It's not meant for wild rewrites of the contract IDLs on active
running services!

All the best for 2020, anyone else who happens to be reading mailing list
emails this NYE!

> thanks,
> Mar.
>
> po 30. 12. 2019 v 20:32 odesílatel Lee Hambley <lee.hamb...@gmail.com>
> napsal:
>
>> Hi Martin,
>>
>> I believe the answer is "just use the schema registry". When you then
>> encode for the network your library should give you a binary package with a
>> 5 byte header that includes the schema version and name from the registry.
>> The reader will when go to the registry and find that schema at that
>> version and use it for decoding.
>>
>> In my experience the naming/etc doesn't matter, only things like defaults
>> in enums and things need to be given a thought, but you'll see that for
>> yourself with experience.
>>
>> HTH, Regards,
>>
>> Lee Hambley
>> http://lee.hambley.name/
>> +49 (0) 170 298 5667
>>
>>
>> On Mon, 30 Dec 2019 at 17:26, Martin Mucha <alfon...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm relatively new to avro, and I'm still struggling with getting schema
>>> evolution and related issues. But today it should be simple question.
>>>
>>> What is recommended naming of types if we want to use schema evolution?
>>> Should namespace contain some information about version of schema? Or
>>> should it be in type itself? Or neither? What is the best practice? Is
>>> evolution even possible if namespace/type name is different?
>>>
>>> I thought that "neither" it's the case, built the app so that version ID
>>> is nowhere except for the directory structure, only latest version is
>>> compiled to java classes using maven plugin, and parsed all other avsc
>>> files in code (to be able to build some sort of schema registry, identify
>>> used writer schema using single object encoding and use schema evolution).
>>> However I used separate Parser instance to parse each schema. But if one
>>> would like to use schema imports, he cannot have separate parser for every
>>> schema, and having global one in this setup is also not possible, as each
>>> type can be registered just once in org.apache.avro.Schema.Names. Btw. I
>>> favored this variant(ie. no ID in name/namespace) because in this setup,
>>> after I introduce new schema version, I do not have to change imports in
>>> whole project, but just one line in pom.xml saying which directory should
>>> be compiled into java files.
>>>
>>> so what could be the suggestion to correct naming-versioning scheme?
>>> thanks,
>>> M.
>>>
>>

Re: Recomended naming of types to support for schema evolution

Reply via email to