+1 , it would resolve AVRO-1340 as well! --z
> On Apr 16, 2016, at 10:26 PM, Ryan Blue <[email protected]> wrote: > > +1 > > On Sat, Apr 16, 2016 at 5:54 PM, Matthieu Monsch <[email protected]> > wrote: > >>> I think it may make sense for >>> enums to add a special value for it. >> >> That would work great. To give slightly more flexibility to users, we >> could even allow the reader’s schema to explicitly specify which symbol to >> use when it reads an unknown symbol. If not specified, resolution would >> fail (consistent with the current behavior). >> >> For example (assuming the corresponding key is called “onUnknown”): >> >>> { >>> “type”: “enum”, >>> “name”: “Platform”, >>> “symbols”: [“UNSUPPORTED”, “ANDROID”, “IOS”], >>> “onUnknown”: “UNSUPPORTED" >>> } >> >> This `enum` would then be able to resolve schemas with extra symbols (read >> as `UNSUPPORTED`). >> >> >> >>>> On Apr 16, 2016, at 1:36 PM, Ryan Blue <[email protected]> >>> wrote: >>> >>> Matthieu, how would that work with enums? I think it may make sense for >>> enums to add a special value for it. >>> >>> rb >>> >>> On Sat, Apr 16, 2016 at 1:23 PM, Matthieu Monsch <[email protected]> >>> wrote: >>> >>>>> I think that substituting different data for >>>>> unrecognized branches in a union isn't the way to fix the problem, but >> I >>>>> have a proposal for a way to fix it that looks more like you'd expect >> in >>>>> the OO example above: by adding the Vehicle class to your read schema. >>>>> >>>>> Right now, unions are first resolved by full class name as required by >>>> the >>>>> spec. But after that, we have some additional rules to match schemas. >>>> These >>>>> rules are how we reconcile name differences from situations like >> writing >>>>> with a generic class and reading with a specific class. I'm proposing >> you >>>>> use a catch-all class (the superclass) with fields that are in all of >> the >>>>> union's branches, and we update schema resolution to allow it. >>>> >>>> >>>> That sounds good. The only thing I’d add is to make this catch-all >>>> behavior explicit in the schema (similar to how field defaults must be >>>> explicitly added). >>>> To help fix another common writer evolution issue, we could also add a >>>> similar catch-all for `enum`s (optional, to be explicitly specified in >> the >>>> schema). >>>> -Matthieu >>>> >>>> >>>> >>>>>> On Apr 12, 2016, at 2:41 PM, Ryan Blue <[email protected]> >>>>> wrote: >>>>> >>>>> Yacine, >>>>> >>>>> Thanks for the extra information. Sorry for my delay in replying, I >>>> wanted >>>>> to time think about your suggestion. >>>>> >>>>> I see what you mean that you can think of a union as the superclass of >>>> its >>>>> options. The reflect object model has test code that does just that, >>>> where >>>>> classes B and C inherit from A and the schema for A is created as a >> union >>>>> of B and C. But, I don't think that your suggestion aligns with the >>>>> expectations of object oriented design. Maybe that's an easier way to >>>>> present my concern: >>>>> >>>>> Say I have a class, Vehicle, with subclasses Car and Truck. I have >>>>> applications that work with my dataset, the vehicles that my company >>>> owns, >>>>> and we buy a bus. If I add a Bus class, what normally happens is that >> an >>>>> older application can work with it. A maintenance tracker would call >>>>> getVehicles and can still get the lastMaintenanceDate for my Bus, even >>>>> though it doesn't know about busses. But what you suggest is that it is >>>>> replaced with a default, say null, in cases like this. >>>>> >>>>> I think that the problem is that Avro no equivalent concept of >>>> inheritance. >>>>> There's only one way to represent it for what you need right now, like >>>>> Matthieu suggested. I think that substituting different data for >>>>> unrecognized branches in a union isn't the way to fix the problem, but >> I >>>>> have a proposal for a way to fix it that looks more like you'd expect >> in >>>>> the OO example above: by adding the Vehicle class to your read schema. >>>>> >>>>> Right now, unions are first resolved by full class name as required by >>>> the >>>>> spec. But after that, we have some additional rules to match schemas. >>>> These >>>>> rules are how we reconcile name differences from situations like >> writing >>>>> with a generic class and reading with a specific class. I'm proposing >> you >>>>> use a catch-all class (the superclass) with fields that are in all of >> the >>>>> union's branches, and we update schema resolution to allow it. So you'd >>>>> have... >>>>> >>>>> record Vehicle { >>>>> long lastMaintenanceDate; >>>>> } >>>>> >>>>> record Car { >>>>> long lastMaintenanceDate; >>>>> int maxChildSeats; >>>>> } >>>>> >>>>> record Truck { >>>>> long lastMainetanceDate; >>>>> long lastWashedDate; >>>>> } >>>>> >>>>> Write schema: [null, Car, Truck] Read schema: [null, Car, Vehicle]. The >>>>> catch-all class could provide the behavior you want, without silently >>>>> giving you incorrect data. >>>>> >>>>> Does that sound like a reasonable solution for your use case? >>>>> >>>>> rb >>>>> >>>>> On Tue, Mar 29, 2016 at 10:03 PM, Matthieu Monsch <[email protected] >>> >>>>> wrote: >>>>> >>>>>> Hi Yacine, >>>>>> >>>>>> I believe Ryan was mentioning that if you start from a schema >> `[“null”, >>>>>> “Car”]` then rather than add a new bus branch to the union, you could >>>>>> update the car’s schema to include new bus fields. For example (using >>>> IDL >>>>>> notation): >>>>>> >>>>>>> // Initial approach >>>>>>> // Evolved to union { null, Car, Bus } >>>>>>> record Car { >>>>>>> string vin; >>>>>>> } >>>>>>> record Bus { >>>>>>> string vin; >>>>>>> int capacity; >>>>>>> } >>>>>>> >>>>>>> // Alternative approach >>>>>>> // Here evolution wouldn't change the union { null, Vehicle } >>>>>>> // Note also that this is actually a compatible evolution from the >>>> first >>>>>> approach, even under current rules (using aliases for name changes). >>>>>>> record Vehicle { >>>>>>> string vin; // (Would be optional if not all branches had it.) >>>>>>> union { null, int } capacity; // The new field is added here. >>>>>>> } >>>>>> >>>>>> Pushing this further, you could also directly start upfront with just >> a >>>>>> “Vehicle” record schema with an optional `car` field and add new >>>> optional >>>>>> fields as necessary (for example a `bus` field). This gives you much >>>>>> greater flexibility for evolving readers and writers separately (but >> you >>>>>> lose the ability to enforce that at most one field is ever present in >>>> the >>>>>> record). >>>>>> >>>>>> I agree that Avro’s current evolution rules are more geared towards >>>> reader >>>>>> evolution: evolving writers while keeping readers constant is >> difficult >>>>>> (namely because they don't support adding branches to unions or >> symbols >>>> to >>>>>> enums). However, adding a completely new and separate "compatibility >>>> mode” >>>>>> has a high complexity cost; both for implementors (the separate >> classes >>>> you >>>>>> mention) and users (who must invoke them specifically). It would be >>>> best to >>>>>> keep evolution rules universal. >>>>>> >>>>>> Maybe we could extend the logic behind field defaults? Enum schemas >>>> could >>>>>> simply contain a default symbol attribute. For unions, it’s a bit >>>> trickier, >>>>>> but we should be able to get around with a default branch attribute >>>> which >>>>>> would allow evolution if and only if the newly added branches can be >>>> read >>>>>> as the default branch. Assuming the attribute names don’t already >> exist, >>>>>> this would be compatible with the current behavior. >>>>>> >>>>>> I think this use-case is common enough that it’d be worth exploring >> (for >>>>>> example, Rest.li [1] adds an `$UNKNOWN` symbol to somewhat address >> this >>>>>> issue; though not completely unfortunately [2]). >>>>>> >>>>>> -Matthieu >>>>>> >>>>>> [1] https://github.com/linkedin/rest.li >>>>>> [2] >> https://github.com/linkedin/rest.li/wiki/Snapshots-and-Resource-Compatibility-Checking#why-is-adding-to-an-enum-considered-backwards-incompatible >>>>>> >>>>>> >>>>>> >>>>>>>> On Mar 29, 2016, at 2:40 AM, Yacine Benabderrahmane < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Hi Ryan, >>>>>>> >>>>>>> Just a little up^^. Could you please (or anyone else) give me a >> little >>>>>>> feedback ? >>>>>>> Thanks in advance. >>>>>>> >>>>>>> Regards, >>>>>>> Yacine. >>>>>>> >>>>>>> 2016-03-21 17:36 GMT+01:00 Yacine Benabderrahmane < >>>>>> [email protected] >>>>>>>> : >>>>>>> >>>>>>>> Hi Ryan, >>>>>>>> >>>>>>>> Thank you for giving feedback. I will try in the following to >> provide >>>>>> you >>>>>>>> some more details about the addressed problem. >>>>>>>> >>>>>>>> But before that, just a brief reminder of the context. Avro has been >>>>>>>> chosen in this project (and by many other ones for sure) especially >>>> for >>>>>> a >>>>>>>> very important feature: enabling forward and backward compatibility >>>>>>>> management through schema life-cycle. Our development model involves >>>>>>>> intensive usage of this feature, and many heavy developments are >> made >>>> in >>>>>>>> parallel streams inside feature teams that share the same schema, >>>>>> provided >>>>>>>> the evolution of the latter complies with the stated compatibility >>>>>> rules. >>>>>>>> This implies that all the entities supported by the Avro schema must >>>>>>>> support the two-way compatibility, including unions. However, in the >>>>>>>> special case of the union, this two-way compatibility is not well >>>>>> supported >>>>>>>> by the current rules. Let me explain you the basement of our point >> of >>>>>> view, >>>>>>>> it remains quite simple. >>>>>>>> >>>>>>>> The use case is to have, for example, >>>>>>>> - a first union version A: >>>>>>>> { "name": "Vehicle", >>>>>>>> "type": ["null", "Car"] } >>>>>>>> - a new version of it B: >>>>>>>> { "name": "Vehicle", >>>>>>>> "type": ["null", "Car", "Bus"] } >>>>>>>> For being forward compatible, an evolution of the union schema must >>>>>>>> guarantee that an old reader reading with A can read the data >> written >>>>>> with >>>>>>>> the new schema B. Getting an error just means that the forward >>>>>>>> compatibility feature is broken. But this is not actually the case >>>> (and >>>>>>>> this behavior is not suitable), because the old reader has a correct >>>>>> schema >>>>>>>> and this schema has evolved naturally to version B to incorporate a >>>> new >>>>>>>> Vehicle type. Not knowing this new type must not produce an error, >> but >>>>>> just >>>>>>>> give the reader a default value, which means: "Either the data is >> not >>>>>> there >>>>>>>> or you do not know how to handle it". >>>>>>>> >>>>>>>> This is thought while keeping in mind that in an object-oriented >> code >>>>>>>> modeling, a union field is seen as a class member with the higher >>>> level >>>>>>>> generic type ("Any" (scala) or "Object" (java5+)...). Therefore, it >> is >>>>>>>> natural for a modeler / programmer to implement the ability of not >>>>>> getting >>>>>>>> the awaited types and using some default value of known type. To >> give >>>> a >>>>>>>> more complete specification, the new mode of compatibility has to >>>> impose >>>>>>>> one rule: the union default value must not change through versions >> and >>>>>> the >>>>>>>> corresponding type must be placed at the top of the types list. This >>>> is >>>>>>>> much easier to handle by development streams, because it is >> addressed >>>>>> once >>>>>>>> for all in the very beginning of the schema life-cycle, than the >> fact >>>> to >>>>>>>> oblige a number of teams, among which some are just not in place >>>>>> anymore, >>>>>>>> to update the whole code just because another dev team has deployed >> a >>>>>> new >>>>>>>> version of the union in the schema. >>>>>>>> >>>>>>>> Now, for being backward compatible, the reader with B must always be >>>>>> able >>>>>>>> to read data written with schema A. Even if the type included in the >>>>>> data >>>>>>>> is not known, so it gets the default value and not an error. >>>>>>>> >>>>>>>> I understand that getting an error could make sense when the >> requested >>>>>>>> field is not present. However, this behavior: >>>>>>>> >>>>>>>> - is very restrictive, meaning: this obliges the old reader to >> update >>>>>>>> its code for integrating the new schema, while he is not managing to >>>>>> do it >>>>>>>> for many reasons: development stream of next delivery is not >>>>>> finished, or >>>>>>>> not engaged, or not even planned - in the case of old and stable >> code >>>>>>>> - breaks the forward compatibility feature: the older reader is not >>>>>>>> able to read the new version of the union without getting an error >>>>>>>> - breaks the backward compatibility feature: the new reader is not >>>>>>>> able to read an old version containing unknown types of the union >>>>>> without >>>>>>>> getting an error >>>>>>>> >>>>>>>> By the way, what do you exactly mean by "pushing evolution lower" >> and >>>>>>>> "update the record"? Could you please give me an example of the >> trick >>>>>> you >>>>>>>> are talking about? >>>>>>>> >>>>>>>> Just to be a bit more precise, we are not targeting to use a >> "trick". >>>>>> This >>>>>>>> life-cycle management should be included in a standard so to keep >> the >>>>>>>> software development clean, production safe and compliant with a >>>> complex >>>>>>>> product road-map. >>>>>>>> >>>>>>>> Finally, you seem to be concerned by the "significant change to the >>>>>>>> current evolution rules". Well, we actually do not change these >> rules, >>>>>> they >>>>>>>> keep just the same. All we are proposing is to introduce a *mode* >>>> where >>>>>>>> the rules of union compatibility change. This mode is materialized >> by >>>> a >>>>>>>> minimum and thin impact of the existing classes without any change >> in >>>>>> the >>>>>>>> behavior, all the logic of the new compatibility mode is implemented >>>> by >>>>>> new >>>>>>>> classes that must be invoked specifically. But you would better see >> it >>>>>> in >>>>>>>> the code patch. >>>>>>>> >>>>>>>> Looking forward to reading your feedback and answers. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Yacine. >>>>>>>> >>>>>>>> >>>>>>>> 2016-03-17 19:00 GMT+01:00 Ryan Blue <[email protected]>: >>>>>>>> >>>>>>>>> Hi Yacine, >>>>>>>>> >>>>>>>>> Thanks for the proposal. It sounds interesting, but I want to make >>>> sure >>>>>>>>> there's a clear use case for this because it's a significant change >>>> to >>>>>> the >>>>>>>>> current evolution rules. Right now we guarantee that a reader will >>>> get >>>>>> an >>>>>>>>> error if the data has an unknown union branch rather than getting a >>>>>>>>> default >>>>>>>>> value. I think that makes sense: if the reader is requesting a >> field, >>>>>> it >>>>>>>>> should get the actual datum for it rather than a default because it >>>>>>>>> doesn't >>>>>>>>> know how to handle it. >>>>>>>>> >>>>>>>>> Could you give us an example use case that requires this new logic? >>>>>>>>> >>>>>>>>> I just want to make sure we can't solve your problem another way. >> For >>>>>>>>> example, pushing evolution lower in the schema usually does the >>>> trick: >>>>>>>>> rather than having ["null", "RecordV1"] => ["null", "RecordV1", >>>>>>>>> "RecordV2"], it is usually better to update the record so that >> older >>>>>>>>> readers can ignore the new fields. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> On Mon, Mar 14, 2016 at 7:30 AM, Yacine Benabderrahmane < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> In order to provide a solution to the union schema evolution >>>> problem, >>>>>>>>> as it >>>>>>>>>> was earlier notified in the thread "add a type to a union >>>>>>>>>> < >> http://search-hadoop.com/m/F2svI1IXrQS1bIFgU1/union+evolution&subj=add+a+type+to+a+union >>>>>>>>>>> " >>>>>>>>>> of the user mailing list, we decided, for the needs of the >> reactive >>>>>>>>>> architecture we have implemented for one of our clients, to >>>> implement >>>>>> an >>>>>>>>>> evolution of the compatibility principle of Avro when using >> Unions. >>>>>> For >>>>>>>>>> reminder, the asked question was about the way to handle the case >>>>>> where >>>>>>>>> a >>>>>>>>>> reader, using an old version of a schema that includes a union, >>>> reads >>>>>>>>> some >>>>>>>>>> data written with a new version of the schema where a type has >> been >>>>>>>>> added >>>>>>>>>> to the union. >>>>>>>>>> >>>>>>>>>> As answered by Martin Kleppman in that thread, one way to handle >>>> this >>>>>>>>> kind >>>>>>>>>> of evolution (a new version of the schema adds a new type type in >> a >>>>>>>>> union) >>>>>>>>>> would be to ensure that all the development streams have >> integrated >>>>>> the >>>>>>>>> new >>>>>>>>>> schema B before deploying it in the IT schema referential. >>>>>>>>>> However, in big structures involving strongly uncorrelated teams >> (in >>>>>> the >>>>>>>>>> product life-cycle point of view), this approach appears to be >> quite >>>>>>>>>> impracticable, causing production stream congestion, blocking >>>> behavior >>>>>>>>>> between teams, and a bunch of other >>>>>>>>>> unwanted-counter-agile-/-reactive-phenomena... >>>>>>>>>> >>>>>>>>>> Therefore, we had to implement a new *compatibility* *mode* for >> the >>>>>>>>> unions, >>>>>>>>>> while taking care to comply with the following rules: >>>>>>>>>> >>>>>>>>>> 1. Clear rules of compatibility are stated and integrated for this >>>>>>>>>> compatibility mode >>>>>>>>>> 2. The standard Avro behavior must be kept intact >>>>>>>>>> 3. All the evolution implementation must be done without >>>> introducing >>>>>>>>> any >>>>>>>>>> regression (all existing tests of the Avro stack must succeed) >>>>>>>>>> 4. The code impact on Avro stack must be minimized >>>>>>>>>> >>>>>>>>>> Just to give you a very brief overview (as I don't know if this is >>>>>>>>> actually >>>>>>>>>> the place for a full detailed description), the evolution >> addresses >>>>>> the >>>>>>>>>> typical problem where two development streams use the same schema >>>> but >>>>>> in >>>>>>>>>> different versions, in the case described shortly as follows: >>>>>>>>>> >>>>>>>>>> - The first development stream, called "DevA", uses the version A >>>> of >>>>>>>>> a >>>>>>>>>> schema which integrates a union referencing two types, say "null" >>>>>> and >>>>>>>>>> "string". The default value is set to null. >>>>>>>>>> - The second development team, called "DevB", uses the version B, >>>>>>>>> which >>>>>>>>>> is an evolution of the version A, as it adds a reference to a new >>>>>>>>> type >>>>>>>>>> in >>>>>>>>>> the former union, say "long" (which makes it "null", string" and >>>>>>>>> "long") >>>>>>>>>> - When the schema B is deployed on the schema referential (in our >>>>>>>>> case, >>>>>>>>>> the IO Confluent Schema Registry) subsequently to the version A >>>>>>>>>> - The stream "DevA" must be able to read with schema A, even if >>>>>>>>> the >>>>>>>>>> data has been written using the schema B with the type "long" >> in >>>>>>>>>> the union. >>>>>>>>>> In the latter case, the read value is the union default value >>>>>>>>>> - The stream "DevB" must be able to read/write with schema B, >>>>>>>>> even if >>>>>>>>>> it writes the data using the type "long" in the union >>>>>>>>>> >>>>>>>>>> The evolution that we implemented for this mode includes some >> rules >>>>>> that >>>>>>>>>> are based on the principles stated in the Avro documentation. It >> is >>>>>> even >>>>>>>>>> more powerful than showed in the few lines above, as it enables >> the >>>>>>>>> readers >>>>>>>>>> to get the default value of the union if the schema used for >> reading >>>>>>>>> does >>>>>>>>>> not contain the type used by the writer in the union. This >> achieves >>>> a >>>>>>>>> new >>>>>>>>>> mode of forward / backward compatibility. This evolution is for >> now >>>>>>>>> working >>>>>>>>>> perfectly, and should be on production in the few coming weeks. We >>>>>> have >>>>>>>>>> also made an evolution of the IO Confluent Schema Registry stack >> to >>>>>>>>> support >>>>>>>>>> it, again in a transparent manner (we also intend to contribute to >>>>>> this >>>>>>>>>> stack in a second / parallel step). >>>>>>>>>> >>>>>>>>>> In the objective of contributing to the Avro stack with this new >>>>>>>>>> compatibility mode for unions, I have some questions about the >>>>>>>>> procedure: >>>>>>>>>> >>>>>>>>>> 1. How can I achieve the contribution proposal? Should I directly >>>>>>>>>> provide a patch in JIRA and dive into the details right there? >>>>>>>>>> 2. The base version of this evolution is 1.7.7, is it eligible to >>>>>>>>>> contribution evaluation anyway? >>>>>>>>>> >>>>>>>>>> Thanks in advance, looking forward to hearing from you and giving >>>> you >>>>>>>>> more >>>>>>>>>> details. >>>>>>>>>> >>>>>>>>>> Kind Regards, >>>>>>>>>> -- >>>>>>>>>> *Yacine Benabderrahmane* >>>>>>>>>> Architect >>>>>>>>>> *OCTO Technology* >>>>>>>>>> <http://www.octo.com> >>>>>>>>>> ----------------------------------------------- >>>>>>>>>> Tel : +33 6 10 88 25 98 >>>>>>>>>> 50 avenue des Champs Elysées >>>>>>>>>> 75008 PARIS >>>>>>>>>> www.octo.com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix > > > -- > Ryan Blue > Software Engineer > Netflix
