Re: Avro union compatibility mode enhancement proposal

Zoltan Farkas Sun, 17 Apr 2016 04:50:58 -0700

+1 , it would resolve AVRO-1340 as well!

--z




> On Apr 16, 2016, at 10:26 PM, Ryan Blue <[email protected]> wrote:
> 
> +1
> 
> On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch <[email protected]>
> wrote:
> 
>>> I think it may make sense for
>>> enums to add a special value for it.
>> 
>> That would work great. To give slightly more flexibility to users, we
>> could even allow the reader’s schema to explicitly specify which symbol to
>> use when it reads an unknown symbol. If not specified, resolution would
>> fail (consistent with the current behavior).
>> 
>> For example (assuming the corresponding key is called “onUnknown”):
>> 
>>> {
>>>  “type”: “enum”,
>>>  “name”: “Platform”,
>>>  “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”],
>>>  “onUnknown”: “UNSUPPORTED"
>>> }
>> 
>> This `enum` would then be able to resolve schemas with extra symbols (read
>> as `UNSUPPORTED`).
>> 
>> 
>> 
>>>> On Apr 16, 2016, at 1:36 PM, Ryan Blue <[email protected]>
>>> wrote:
>>> 
>>> Matthieu, how would that work with enums? I think it may make sense for
>>> enums to add a special value for it.
>>> 
>>> rb
>>> 
>>> On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <[email protected]>
>>> wrote:
>>> 
>>>>> I think that substituting different data for
>>>>> unrecognized branches in a union isn't the way to fix the problem, but
>> I
>>>>> have a proposal for a way to fix it that looks more like you'd expect
>> in
>>>>> the OO example above: by adding the Vehicle class to your read schema.
>>>>> 
>>>>> Right now, unions are first resolved by full class name as required by
>>>> the
>>>>> spec. But after that, we have some additional rules to match schemas.
>>>> These
>>>>> rules are how we reconcile name differences from situations like
>> writing
>>>>> with a generic class and reading with a specific class. I'm proposing
>> you
>>>>> use a catch-all class (the superclass) with fields that are in all of
>> the
>>>>> union's branches, and we update schema resolution to allow it.
>>>> 
>>>> 
>>>> That sounds good. The only thing I’d add is to make this catch-all
>>>> behavior explicit in the schema (similar to how field defaults must be
>>>> explicitly added).
>>>> To help fix another common writer evolution issue, we could also add a
>>>> similar catch-all for `enum`s (optional, to be explicitly specified in
>> the
>>>> schema).
>>>> -Matthieu
>>>> 
>>>> 
>>>> 
>>>>>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <[email protected]>
>>>>> wrote:
>>>>> 
>>>>> Yacine,
>>>>> 
>>>>> Thanks for the extra information. Sorry for my delay in replying, I
>>>> wanted
>>>>> to time think about your suggestion.
>>>>> 
>>>>> I see what you mean that you can think of a union as the superclass of
>>>> its
>>>>> options. The reflect object model has test code that does just that,
>>>> where
>>>>> classes B and C inherit from A and the schema for A is created as a
>> union
>>>>> of B and C. But, I don't think that your suggestion aligns with the
>>>>> expectations of object oriented design. Maybe that's an easier way to
>>>>> present my concern:
>>>>> 
>>>>> Say I have a class, Vehicle, with subclasses Car and Truck. I have
>>>>> applications that work with my dataset, the vehicles that my company
>>>> owns,
>>>>> and we buy a bus. If I add a Bus class, what normally happens is that
>> an
>>>>> older application can work with it. A maintenance tracker would call
>>>>> getVehicles and can still get the lastMaintenanceDate for my Bus, even
>>>>> though it doesn't know about busses. But what you suggest is that it is
>>>>> replaced with a default, say null, in cases like this.
>>>>> 
>>>>> I think that the problem is that Avro no equivalent concept of
>>>> inheritance.
>>>>> There's only one way to represent it for what you need right now, like
>>>>> Matthieu suggested. I think that substituting different data for
>>>>> unrecognized branches in a union isn't the way to fix the problem, but
>> I
>>>>> have a proposal for a way to fix it that looks more like you'd expect
>> in
>>>>> the OO example above: by adding the Vehicle class to your read schema.
>>>>> 
>>>>> Right now, unions are first resolved by full class name as required by
>>>> the
>>>>> spec. But after that, we have some additional rules to match schemas.
>>>> These
>>>>> rules are how we reconcile name differences from situations like
>> writing
>>>>> with a generic class and reading with a specific class. I'm proposing
>> you
>>>>> use a catch-all class (the superclass) with fields that are in all of
>> the
>>>>> union's branches, and we update schema resolution to allow it. So you'd
>>>>> have...
>>>>> 
>>>>> record Vehicle {
>>>>> long lastMaintenanceDate;
>>>>> }
>>>>> 
>>>>> record Car {
>>>>> long lastMaintenanceDate;
>>>>> int maxChildSeats;
>>>>> }
>>>>> 
>>>>> record Truck {
>>>>> long lastMainetanceDate;
>>>>> long lastWashedDate;
>>>>> }
>>>>> 
>>>>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The
>>>>> catch-all class could provide the behavior you want, without silently
>>>>> giving you incorrect data.
>>>>> 
>>>>> Does that sound like a reasonable solution for your use case?
>>>>> 
>>>>> rb
>>>>> 
>>>>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi Yacine,
>>>>>> 
>>>>>> I believe Ryan was mentioning that if you start from a schema
>> `[“null”,
>>>>>> “Car”]` then rather than add a new bus branch to the union, you could
>>>>>> update the car’s schema to include new bus fields. For example (using
>>>> IDL
>>>>>> notation):
>>>>>> 
>>>>>>> // Initial approach
>>>>>>> // Evolved to union { null, Car, Bus }
>>>>>>> record Car {
>>>>>>> string vin;
>>>>>>> }
>>>>>>> record Bus {
>>>>>>> string vin;
>>>>>>> int capacity;
>>>>>>> }
>>>>>>> 
>>>>>>> // Alternative approach
>>>>>>> // Here evolution wouldn't change the union { null, Vehicle }
>>>>>>> // Note also that this is actually a compatible evolution from the
>>>> first
>>>>>> approach, even under current rules (using aliases for name changes).
>>>>>>> record Vehicle {
>>>>>>> string vin; // (Would be optional if not all branches had it.)
>>>>>>> union { null, int } capacity; // The new field is added here.
>>>>>>> }
>>>>>> 
>>>>>> Pushing this further, you could also directly start upfront with just
>> a
>>>>>> “Vehicle” record schema with an optional `car` field and add new
>>>> optional
>>>>>> fields as necessary (for example a `bus` field). This gives you much
>>>>>> greater flexibility for evolving readers and writers separately (but
>> you
>>>>>> lose the ability to enforce that at most one field is ever present in
>>>> the
>>>>>> record).
>>>>>> 
>>>>>> I agree that Avro’s current evolution rules are more geared towards
>>>> reader
>>>>>> evolution: evolving writers while keeping readers constant is
>> difficult
>>>>>> (namely because they don't support adding branches to unions or
>> symbols
>>>> to
>>>>>> enums). However, adding a completely new and separate "compatibility
>>>> mode”
>>>>>> has a high complexity cost; both for implementors (the separate
>> classes
>>>> you
>>>>>> mention) and users (who must invoke them specifically). It would be
>>>> best to
>>>>>> keep evolution rules universal.
>>>>>> 
>>>>>> Maybe we could extend the logic behind field defaults? Enum schemas
>>>> could
>>>>>> simply contain a default symbol attribute. For unions, it’s a bit
>>>> trickier,
>>>>>> but we should be able to get around with a default branch attribute
>>>> which
>>>>>> would allow evolution if and only if the newly added branches can be
>>>> read
>>>>>> as the default branch. Assuming the attribute names don’t already
>> exist,
>>>>>> this would be compatible with the current behavior.
>>>>>> 
>>>>>> I think this use-case is common enough that it’d be worth exploring
>> (for
>>>>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address
>> this
>>>>>> issue; though not completely unfortunately [2]).
>>>>>> 
>>>>>> -Matthieu
>>>>>> 
>>>>>> [1] https://github.com/linkedin/rest.li
>>>>>> [2]
>> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>> Hi Ryan,
>>>>>>> 
>>>>>>> Just a little up^^. Could you please (or anyone else) give me a
>> little
>>>>>>> feedback ?
>>>>>>> Thanks in advance.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Yacine.
>>>>>>> 
>>>>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane <
>>>>>> [email protected]
>>>>>>>> :
>>>>>>> 
>>>>>>>> Hi Ryan,
>>>>>>>> 
>>>>>>>> Thank you for giving feedback. I will try in the following to
>> provide
>>>>>> you
>>>>>>>> some more details about the addressed problem.
>>>>>>>> 
>>>>>>>> But before that, just a brief reminder of the context. Avro has been
>>>>>>>> chosen in this project (and by many other ones for sure) especially
>>>> for
>>>>>> a
>>>>>>>> very important feature: enabling forward and backward compatibility
>>>>>>>> management through schema life-cycle. Our development model involves
>>>>>>>> intensive usage of this feature, and many heavy developments are
>> made
>>>> in
>>>>>>>> parallel streams inside feature teams that share the same schema,
>>>>>> provided
>>>>>>>> the evolution of the latter complies with the stated compatibility
>>>>>> rules.
>>>>>>>> This implies that all the entities supported by the Avro schema must
>>>>>>>> support the two-way compatibility, including unions. However, in the
>>>>>>>> special case of the union, this two-way compatibility is not well
>>>>>> supported
>>>>>>>> by the current rules. Let me explain you the basement of our point
>> of
>>>>>> view,
>>>>>>>> it remains quite simple.
>>>>>>>> 
>>>>>>>> The use case is to have, for example,
>>>>>>>> - a first union version A:
>>>>>>>>   { "name": "Vehicle",
>>>>>>>>     "type": ["null", "Car"] }
>>>>>>>> - a new version of it B:
>>>>>>>>   { "name": "Vehicle",
>>>>>>>>     "type": ["null", "Car", "Bus"] }
>>>>>>>> For being forward compatible, an evolution of the union schema must
>>>>>>>> guarantee that an old reader reading with A can read the data
>> written
>>>>>> with
>>>>>>>> the new schema B. Getting an error just means that the forward
>>>>>>>> compatibility feature is broken. But this is not actually the case
>>>> (and
>>>>>>>> this behavior is not suitable), because the old reader has a correct
>>>>>> schema
>>>>>>>> and this schema has evolved naturally to version B to incorporate a
>>>> new
>>>>>>>> Vehicle type. Not knowing this new type must not produce an error,
>> but
>>>>>> just
>>>>>>>> give the reader a default value, which means: "Either the data is
>> not
>>>>>> there
>>>>>>>> or you do not know how to handle it".
>>>>>>>> 
>>>>>>>> This is thought while keeping in mind that in an object-oriented
>> code
>>>>>>>> modeling, a union field is seen as a class member with the higher
>>>> level
>>>>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it
>> is
>>>>>>>> natural for a modeler / programmer to implement the ability of not
>>>>>> getting
>>>>>>>> the awaited types and using some default value of known type. To
>> give
>>>> a
>>>>>>>> more complete specification, the new mode of compatibility has to
>>>> impose
>>>>>>>> one rule: the union default value must not change through versions
>> and
>>>>>> the
>>>>>>>> corresponding type must be placed at the top of the types list. This
>>>> is
>>>>>>>> much easier to handle by development streams, because it is
>> addressed
>>>>>> once
>>>>>>>> for all in the very beginning of the schema life-cycle, than the
>> fact
>>>> to
>>>>>>>> oblige a number of teams, among which some are just not in place
>>>>>> anymore,
>>>>>>>> to update the whole code just because another dev team has deployed
>> a
>>>>>> new
>>>>>>>> version of the union in the schema.
>>>>>>>> 
>>>>>>>> Now, for being backward compatible, the reader with B must always be
>>>>>> able
>>>>>>>> to read data written with schema A. Even if the type included in the
>>>>>> data
>>>>>>>> is not known, so it gets the default value and not an error.
>>>>>>>> 
>>>>>>>> I understand that getting an error could make sense when the
>> requested
>>>>>>>> field is not present. However, this behavior:
>>>>>>>> 
>>>>>>>> - is very restrictive, meaning: this obliges the old reader to
>> update
>>>>>>>> its code for integrating the new schema, while he is not managing to
>>>>>> do it
>>>>>>>> for many reasons: development stream of next delivery is not
>>>>>> finished, or
>>>>>>>> not engaged, or not even planned - in the case of old and stable
>> code
>>>>>>>> - breaks the forward compatibility feature: the older reader is not
>>>>>>>> able to read the new version of the union without getting an error
>>>>>>>> - breaks the backward compatibility feature: the new reader is not
>>>>>>>> able to read an old version containing unknown types of the union
>>>>>> without
>>>>>>>> getting an error
>>>>>>>> 
>>>>>>>> By the way, what do you exactly mean by "pushing evolution lower"
>> and
>>>>>>>> "update the record"? Could you please give me an example of the
>> trick
>>>>>> you
>>>>>>>> are talking about?
>>>>>>>> 
>>>>>>>> Just to be a bit more precise, we are not targeting to use a
>> "trick".
>>>>>> This
>>>>>>>> life-cycle management should be included in a standard so to keep
>> the
>>>>>>>> software development clean, production safe and compliant with a
>>>> complex
>>>>>>>> product road-map.
>>>>>>>> 
>>>>>>>> Finally, you seem to be concerned by the "significant change to the
>>>>>>>> current evolution rules". Well, we actually do not change these
>> rules,
>>>>>> they
>>>>>>>> keep just the same. All we are proposing is to introduce a *mode*
>>>> where
>>>>>>>> the rules of union compatibility change. This mode is materialized
>> by
>>>> a
>>>>>>>> minimum and thin impact of the existing classes without any change
>> in
>>>>>> the
>>>>>>>> behavior, all the logic of the new compatibility mode is implemented
>>>> by
>>>>>> new
>>>>>>>> classes that must be invoked specifically. But you would better see
>> it
>>>>>> in
>>>>>>>> the code patch.
>>>>>>>> 
>>>>>>>> Looking forward to reading your feedback and answers.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Yacine.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <[email protected]>:
>>>>>>>> 
>>>>>>>>> Hi Yacine,
>>>>>>>>> 
>>>>>>>>> Thanks for the proposal. It sounds interesting, but I want to make
>>>> sure
>>>>>>>>> there's a clear use case for this because it's a significant change
>>>> to
>>>>>> the
>>>>>>>>> current evolution rules. Right now we guarantee that a reader will
>>>> get
>>>>>> an
>>>>>>>>> error if the data has an unknown union branch rather than getting a
>>>>>>>>> default
>>>>>>>>> value. I think that makes sense: if the reader is requesting a
>> field,
>>>>>> it
>>>>>>>>> should get the actual datum for it rather than a default because it
>>>>>>>>> doesn't
>>>>>>>>> know how to handle it.
>>>>>>>>> 
>>>>>>>>> Could you give us an example use case that requires this new logic?
>>>>>>>>> 
>>>>>>>>> I just want to make sure we can't solve your problem another way.
>> For
>>>>>>>>> example, pushing evolution lower in the schema usually does the
>>>> trick:
>>>>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1",
>>>>>>>>> "RecordV2"], it is usually better to update the record so that
>> older
>>>>>>>>> readers can ignore the new fields.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> rb
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> In order to provide a solution to the union schema evolution
>>>> problem,
>>>>>>>>> as it
>>>>>>>>>> was earlier notified in the thread "add a type to a union
>>>>>>>>>> <
>> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union
>>>>>>>>>>> "
>>>>>>>>>> of the user mailing list, we decided, for the needs of the
>> reactive
>>>>>>>>>> architecture we have implemented for one of our clients, to
>>>> implement
>>>>>> an
>>>>>>>>>> evolution of the compatibility principle of Avro when using
>> Unions.
>>>>>> For
>>>>>>>>>> reminder, the asked question was about the way to handle the case
>>>>>> where
>>>>>>>>> a
>>>>>>>>>> reader, using an old version of a schema that includes a union,
>>>> reads
>>>>>>>>> some
>>>>>>>>>> data written with a new version of the schema where a type has
>> been
>>>>>>>>> added
>>>>>>>>>> to the union.
>>>>>>>>>> 
>>>>>>>>>> As answered by Martin Kleppman in that thread, one way to handle
>>>> this
>>>>>>>>> kind
>>>>>>>>>> of evolution (a new version of the schema adds a new type type in
>> a
>>>>>>>>> union)
>>>>>>>>>> would be to ensure that all the development streams have
>> integrated
>>>>>> the
>>>>>>>>> new
>>>>>>>>>> schema B before deploying it in the IT schema referential.
>>>>>>>>>> However, in big structures involving strongly uncorrelated teams
>> (in
>>>>>> the
>>>>>>>>>> product life-cycle point of view), this approach appears to be
>> quite
>>>>>>>>>> impracticable, causing production stream congestion, blocking
>>>> behavior
>>>>>>>>>> between teams, and a bunch of other
>>>>>>>>>> unwanted-counter-agile-/-reactive-phenomena...
>>>>>>>>>> 
>>>>>>>>>> Therefore, we had to implement a new *compatibility* *mode* for
>> the
>>>>>>>>> unions,
>>>>>>>>>> while taking care to comply with the following rules:
>>>>>>>>>> 
>>>>>>>>>> 1. Clear rules of compatibility are stated and integrated for this
>>>>>>>>>> compatibility mode
>>>>>>>>>> 2. The standard Avro behavior must be kept intact
>>>>>>>>>> 3. All the evolution implementation must be done without
>>>> introducing
>>>>>>>>> any
>>>>>>>>>> regression (all existing tests of the Avro stack must succeed)
>>>>>>>>>> 4. The code impact on Avro stack must be minimized
>>>>>>>>>> 
>>>>>>>>>> Just to give you a very brief overview (as I don't know if this is
>>>>>>>>> actually
>>>>>>>>>> the place for a full detailed description), the evolution
>> addresses
>>>>>> the
>>>>>>>>>> typical problem where two development streams use the same schema
>>>> but
>>>>>> in
>>>>>>>>>> different versions, in the case described shortly as follows:
>>>>>>>>>> 
>>>>>>>>>> - The first development stream, called "DevA", uses the version A
>>>> of
>>>>>>>>> a
>>>>>>>>>> schema which integrates a union referencing two types, say "null"
>>>>>> and
>>>>>>>>>> "string". The default value is set to null.
>>>>>>>>>> - The second development team, called "DevB", uses the version B,
>>>>>>>>> which
>>>>>>>>>> is an evolution of the version A, as it adds a reference to a new
>>>>>>>>> type
>>>>>>>>>> in
>>>>>>>>>> the former union, say "long" (which makes it "null", string" and
>>>>>>>>> "long")
>>>>>>>>>> - When the schema B is deployed on the schema referential (in our
>>>>>>>>> case,
>>>>>>>>>> the IO Confluent Schema Registry) subsequently to the version A
>>>>>>>>>>   - The stream "DevA" must be able to read with schema A, even if
>>>>>>>>> the
>>>>>>>>>>   data has been written using the schema B with the type "long"
>> in
>>>>>>>>>> the union.
>>>>>>>>>>   In the latter case, the read value is the union default value
>>>>>>>>>>   - The stream "DevB" must be able to read/write with schema B,
>>>>>>>>> even if
>>>>>>>>>>   it writes the data using the type "long" in the union
>>>>>>>>>> 
>>>>>>>>>> The evolution that we implemented for this mode includes some
>> rules
>>>>>> that
>>>>>>>>>> are based on the principles stated in the Avro documentation. It
>> is
>>>>>> even
>>>>>>>>>> more powerful than showed in the few lines above, as it enables
>> the
>>>>>>>>> readers
>>>>>>>>>> to get the default value of the union if the schema used for
>> reading
>>>>>>>>> does
>>>>>>>>>> not contain the type used by the writer in the union. This
>> achieves
>>>> a
>>>>>>>>> new
>>>>>>>>>> mode of forward / backward compatibility. This evolution is for
>> now
>>>>>>>>> working
>>>>>>>>>> perfectly, and should be on production in the few coming weeks. We
>>>>>> have
>>>>>>>>>> also made an evolution of the IO Confluent Schema Registry stack
>> to
>>>>>>>>> support
>>>>>>>>>> it, again in a transparent manner (we also intend to contribute to
>>>>>> this
>>>>>>>>>> stack in a second / parallel step).
>>>>>>>>>> 
>>>>>>>>>> In the objective of contributing to the Avro stack with this new
>>>>>>>>>> compatibility mode for unions, I have some questions about the
>>>>>>>>> procedure:
>>>>>>>>>> 
>>>>>>>>>> 1. How can I achieve the contribution proposal? Should I directly
>>>>>>>>>> provide a patch in JIRA and dive into the details right there?
>>>>>>>>>> 2. The base version of this evolution is 1.7.7, is it eligible to
>>>>>>>>>> contribution evaluation anyway?
>>>>>>>>>> 
>>>>>>>>>> Thanks in advance, looking forward to hearing from you and giving
>>>> you
>>>>>>>>> more
>>>>>>>>>> details.
>>>>>>>>>> 
>>>>>>>>>> Kind Regards,
>>>>>>>>>> --
>>>>>>>>>> *Yacine Benabderrahmane*
>>>>>>>>>> Architect
>>>>>>>>>> *OCTO Technology*
>>>>>>>>>> <http://www.octo.com>
>>>>>>>>>> -----------------------------------------------
>>>>>>>>>> Tel : +33 6 10 88 25 98
>>>>>>>>>> 50 avenue des Champs Elysées
>>>>>>>>>> 75008 PARIS
>>>>>>>>>> www.octo.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Avro union compatibility mode enhancement proposal

Reply via email to