Thanks for the review Konstantine! I think the terminology suggestion definitely makes things clearer - I will update the documentation based on your suggestion (e.g. s/Consumer/Sink Converter/g and s/Producer/Source Converter/g).
Cheers, Almog On Wed, Aug 14, 2019 at 8:13 AM Konstantine Karantasis < konstant...@confluent.io> wrote: > Thanks Almog for preparing this KIP! > I think it will improve usability and troubleshooting with JSON data a lot. > > The finalized plan seems quite concrete now. I also liked that some > implementation specific implications (such as setting the ObjectMapper to > deserialize floating point as BigDecimal) are highlighted in the KIP. > > Still, as I was reading the KIP, the main obstacle I encountered was around > terminology. I couldn't get used to reading "producer" and "consumer" and > not thinking in terms of Kafka producers and consumers - which are not > relevant to what this KIP proposes. Thus, I'd suggest replacing > "Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink > Converter(s)" (even if "Converter used by Source Connector" or "Converter > used by Sink Connector" would be even more accurate - maybe this could be > an explanation in a footnote). Terminology around converters has been > tricky in the past and adding producers/consumers in the mix might add to > the confusion. > > Another example where I'd apply this different terminology would be to a > phrase such as the following: > "Because of this, users must take care to first ensure that all consumers > have upgraded to the new code before upgrading producers to make use of the > NUMERIC serialization format." > which I'd write > "Because of this, users must take care to first ensure that all sink > connectors have upgraded to the new converter code before upgrading source > connectors to make use of the NUMERIC serialization format in > JsonConverter." > > Let me know if you think this suggestion makes the KIP easier to follow. > Otherwise I think it's a solid proposal. > > I'm concluding with a couple of nits: > > - "Upgraded Producer with BASE64 serialization, Legacy Consumer: this > scenario is okay as the upgraded ~producer~ consumer will be able to read > binary as today" (again according to my suggestion above, it could be as > the upgraded source converter ...) > > - "consumers cannot consumer NUMERIC data. " -> "consumers cannot read > NUMERIC data" > > Best, > Konstantine > > On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote: > > > Good catches! Fixed :) > > > > On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <arjun.sat...@gmail.com> > > wrote: > > > > > Cool! > > > > > > Couple of nits: > > > > > > - In public interfaces, typo: *json.decimal.serialization.fromat* > > > - In public interfaces, you use the term "HEX" instead of "BASE64". > > > > > > > > > > > > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote: > > > > > > > EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". > I > > > will > > > > update the KIP to reflect this. > > > > > > > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> > wrote: > > > > > > > > > Thanks for the feedback Arjun! I'm happy changing the default > config > > to > > > > > HEX instead of BINARY, no strong feelings there. > > > > > > > > > > I'll also clarify the example in the KIP to be clearer: > > > > > > > > > > - serialize the decimal field "foo" with value "10.2345" with the > HEX > > > > > setting: {"foo": "D3J5"} > > > > > - serialize the decimal field "foo" with value "10.2345" with the > > > NUMERIC > > > > > setting: {"foo": 10.2345} > > > > > > > > > > With regards to the precision issue, that was my original concern > as > > > well > > > > > (and why I originally suggested a TEXT format). Many JSON > > deserializers > > > > > (e.g. Jackson with > > DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS), > > > > > however, have the ability to deserialize decimals correctly so I > will > > > > > configure that as the default for Connect's JsonDeserializer. It's > > > > probably > > > > > a good idea to call out that using other deserializers must be done > > > with > > > > > care - I will add that documentation to the serialization config. > > > > > > > > > > Note that there would not be an issue on the _serialization_ side > of > > > > > things as Jackson respects BigDecimal. > > > > > > > > > > Almog > > > > > > > > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish < > arjun.sat...@gmail.com > > > > > > > > wrote: > > > > > > > > > >> hey Almog, nice work! couple of thoughts (hope I'm not late since > > you > > > > >> started the voting thread already): > > > > >> > > > > >> can you please add some examples to show the changes that you are > > > > >> proposing. makes me think that for a given decimal number, we will > > > have > > > > >> two > > > > >> encodings: “asHex” and “asNumber”. > > > > >> > > > > >> should we call the default config value “HEX” instead of “BINARY”? > > > > >> > > > > >> Should we call out the fact that JS systems might be susceptible > to > > > > double > > > > >> precision round offs with the new numeric format? here are some > > people > > > > >> discussing a similar problem > > > > >> https://github.com/EventStore/EventStore/issues/1541 > > > > >> > > > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io> > > > wrote: > > > > >> > > > > >> > Hello Everyone, > > > > >> > > > > > >> > Summarizing an in-person discussion with Randall (this is copied > > > from > > > > >> the > > > > >> > KIP): > > > > >> > > > > > >> > The original KIP suggested supporting an additional > > representation - > > > > >> base10 > > > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues > > > because > > > > >> it > > > > >> > is impossible to disambiguate between TEXT and BINARY without an > > > > >> additional > > > > >> > config - furthermore, this makes the migration from one to the > > other > > > > >> nearly > > > > >> > impossible because it would require that all consumers stop > > > consuming > > > > >> and > > > > >> > producers stop producing and atomically updating the config on > all > > > of > > > > >> them > > > > >> > after deploying the new code, or waiting for the full retention > > > period > > > > >> to > > > > >> > pass - neither option is viable. The suggestion in the KIP is > > > strictly > > > > >> an > > > > >> > improvement over the existing behavior, even if it doesn't > support > > > all > > > > >> > combinations. > > > > >> > > > > > >> > It seems that since most real-world use cases actually use the > > > numeric > > > > >> > representation (not string) we can consider this an improvement. > > > With > > > > >> the > > > > >> > new suggestion, we don't need a deserialization configuration > > (only > > > a > > > > >> > serialization option) and the consumers will be able to always > > > > >> > automatically determine the serialization format. > > > > >> > > > > > >> > Based on this, I'll be opening up the simplified version of the > > KIP > > > > to a > > > > >> > vote. > > > > >> > > > > > >> > Almog > > > > >> > > > > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io > > > > > > wrote: > > > > >> > > > > > >> > > I'm mostly happy with your current suggestion (two configs, > one > > > for > > > > >> > > serialization and one for deserialization) and your > > implementation > > > > >> > > suggestion. One thing to note: > > > > >> > > > > > > >> > > > We should _always_ be able to deserialize a standard JSON > > > > >> > > > number as a decimal > > > > >> > > > > > > >> > > I was doing some research into decimals and JSON, and I can > > > imagine > > > > a > > > > >> > > compelling reason to require string representations to avoid > > > losing > > > > >> > > precision and to be certain that whomever is sending the data > > > isn't > > > > >> > losing > > > > >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040 > ). > > > > >> > > > > > > >> > > I'm okay with always allowing numerics, but thought it's worth > > > > raising > > > > >> > the > > > > >> > > thought. > > > > >> > > > > > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates < > a...@confluent.io> > > > > >> wrote: > > > > >> > > > > > > >> > >> The way I see it, we need to control two seperate things: > > > > >> > >> > > > > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a > > text > > > > >> node in > > > > >> > >> the JSON? (We should _always_ be able to deserialize a > > > standard > > > > >> JSON > > > > >> > >> number as a decimal). > > > > >> > >> 2. How do we chose how we want decimals to be _serialized_. > > > > >> > >> > > > > >> > >> This looks to fits well with your second suggestion of > slightly > > > > >> > different > > > > >> > >> configs names for serialization vs deserialization. > > > > >> > >> a, For deserialization we only care about how to handle text > > > > nodes: ` > > > > >> > >> deserialization.decimal.*text*.format`, which should only > have > > > two > > > > >> valid > > > > >> > >> values BINARY | TEXT. > > > > >> > >> b. For serialization we need all three: > > > > >> `serialization.decimal.format`, > > > > >> > >> which should support all three options: BINARY | TEXT | > > NUMERIC. > > > > >> > >> > > > > >> > >> Implementation wise, I think these should be two separate > > enums, > > > > >> rather > > > > >> > >> than one shared enum and throwing an error if the > deserializer > > is > > > > >> set to > > > > >> > >> NUMERIC. Mainly as this means the enums reflect the options > > > > >> available, > > > > >> > >> rather than this being hidden in config checking code. But > > > that's > > > > a > > > > >> > minor > > > > >> > >> implementation detail. > > > > >> > >> > > > > >> > >> Personally, I'd be tempted to have the BINARY value named > > > something > > > > >> like > > > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to > > move > > > > >> away > > > > >> > >> from > > > > >> > >> it. > > > > >> > >> > > > > >> > >> It's a real shame that both of these settings require a > default > > > of > > > > >> > BINARY > > > > >> > >> for backwards compatibility, but I agree that discussions / > > plans > > > > >> around > > > > >> > >> switching the defaults should not block this KIP. > > > > >> > >> > > > > >> > >> Andy > > > > >> > >> > > > > >> > >> > > > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra < > al...@confluent.io> > > > > >> wrote: > > > > >> > >> > > > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)! > > > > >> > >> > > > > > >> > >> > > Is the text decimal a base16 encoded number, or is it > > base16 > > > > >> encoded > > > > >> > >> > binary > > > > >> > >> > > form of the number? > > > > >> > >> > > > > > >> > >> > The conversion happens as > > decimal.unscaledValue().toByteArray() > > > > and > > > > >> > then > > > > >> > >> > the byte array is converted to a hex string, so it's > > definitely > > > > the > > > > >> > >> binary > > > > >> > >> > form of the number converted to base16. Whether or not > that's > > > the > > > > >> same > > > > >> > >> as > > > > >> > >> > the base16 encoded number is a good question (toByteArray > > > > returns a > > > > >> > byte > > > > >> > >> > array containing a signed, big-endian, two's complement > > > > >> representation > > > > >> > >> of > > > > >> > >> > the big integer). > > > > >> > >> > > > > > >> > >> > > One suggestion I have is to change the proposed new > config > > to > > > > >> only > > > > >> > >> affect > > > > >> > >> > > decimals stored as text, i.e. to switch between the > current > > > > >> base16 > > > > >> > and > > > > >> > >> > the > > > > >> > >> > > more common base10. Then add another config to the > > > serializer > > > > >> only > > > > >> > >> that > > > > >> > >> > > controls if decimals should be serialized as text or > > numeric. > > > > >> > >> > > > > > >> > >> > I think we need to be able to handle all mappings from > > > > >> serialization > > > > >> > >> format > > > > >> > >> > to deserialization format (e.g. read in BINARY and output > > > TEXT), > > > > >> > which I > > > > >> > >> > think would be impossible with the alternative suggestion. > I > > > > agree > > > > >> > that > > > > >> > >> > automatically deserializing numerics is valuable. I see two > > > other > > > > >> ways > > > > >> > >> to > > > > >> > >> > get this, both keeping the serialization.format config the > > > same: > > > > >> > >> > > > > > >> > >> > - have json.decimal.deserialization.format accept all three > > > > >> formats. > > > > >> > if > > > > >> > >> set > > > > >> > >> > to BINARY/TEXT, numerics would be automatically supported. > If > > > set > > > > >> to > > > > >> > >> > NUMERIC, then any string coming in would result in > > > > deserialization > > > > >> > error > > > > >> > >> > (defaults to BINARY for backwards compatibility) > > > > >> > >> > - change json.decimal.deserialization.format to > > > > >> > >> > json.decimal.deserialization.string.format which accepts > only > > > > >> > >> BINARY/TEXT > > > > >> > >> > (defaults to BINARY for backwards compatibility) > > > > >> > >> > > > > > >> > >> > > would be a breaking change in that things that previously > > > > failed > > > > >> > would > > > > >> > >> > > suddenly start deserializing. This is a price I'm > willing > > to > > > > >> pay. > > > > >> > >> > > > > > >> > >> > I agree. I'm willing to pay this price too. > > > > >> > >> > > > > > >> > >> > > IMHO, we should then plan to switch the default of > decimal > > > > >> > >> serialization > > > > >> > >> > to > > > > >> > >> > > numeric, and text serialization to base 10 in the next > > major > > > > >> > release. > > > > >> > >> > > > > > >> > >> > I think that can be a separate discussion, I don't want to > > > block > > > > >> this > > > > >> > >> KIP > > > > >> > >> > on it. > > > > >> > >> > > > > > >> > >> > Thoughts? > > > > >> > >> > > > > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto < > > > o...@wikimedia.org> > > > > >> > wrote: > > > > >> > >> > > > > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I > use > > > > >> > >> JSONSchemas to > > > > >> > >> > > indicate whether a JSON number should be deserialized as > an > > > > >> integer > > > > >> > >> or a > > > > >> > >> > > decimal > > > > >> > >> > > < > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261 > > > > >> > >> > > >. > > > > >> > >> > > Not everyone is going to have JSONSchemas available when > > > > >> converting, > > > > >> > >> but > > > > >> > >> > if > > > > >> > >> > > you do, it is an easy way to support JSON numbers as > > > decimals. > > > > >> > >> > > > > > > >> > >> > > Carry on! :) > > > > >> > >> > > > > > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates < > > > a...@confluent.io > > > > > > > > > >> > >> wrote: > > > > >> > >> > > > > > > >> > >> > > > Hi Almog, > > > > >> > >> > > > > > > > >> > >> > > > Like the KIP - I think being able to support decimals > in > > > JSON > > > > >> in > > > > >> > the > > > > >> > >> > same > > > > >> > >> > > > way most other systems do is a great improvement. > > > > >> > >> > > > > > > > >> > >> > > > It's not 100% clear to me from the KIP what the current > > > > format > > > > >> is. > > > > >> > >> Is > > > > >> > >> > > the > > > > >> > >> > > > text decimal a base16 encoded number, or is it base16 > > > encoded > > > > >> > binary > > > > >> > >> > form > > > > >> > >> > > > of the number? (I've not tried to get my head around if > > > these > > > > >> two > > > > >> > >> are > > > > >> > >> > > even > > > > >> > >> > > > different!) > > > > >> > >> > > > > > > > >> > >> > > > One suggestion I have is to change the proposed new > > config > > > to > > > > >> only > > > > >> > >> > affect > > > > >> > >> > > > decimals stored as text, i.e. to switch between the > > current > > > > >> base16 > > > > >> > >> and > > > > >> > >> > > the > > > > >> > >> > > > more common base10. Then add another config to the > > > > serialzier > > > > >> > only > > > > >> > >> > that > > > > >> > >> > > > controls if decimals should be serialized as text or > > > numeric. > > > > >> The > > > > >> > >> > > benefit > > > > >> > >> > > > of this approach is it allows us to enhance the > > > deserializer > > > > to > > > > >> > >> > > > automatically handle numeric decimals even without any > > > config > > > > >> > >> having to > > > > >> > >> > > be > > > > >> > >> > > > set, i.e. default config in the deserializer would be > > able > > > to > > > > >> > handle > > > > >> > >> > > > numeric decimals. Of course, this is a two edged > sword: > > > this > > > > >> > would > > > > >> > >> > make > > > > >> > >> > > > the deserializer work out of the box with numeric > > decimals, > > > > >> > (yay!), > > > > >> > >> but > > > > >> > >> > > > would be a breaking change in that things that > previously > > > > >> failed > > > > >> > >> would > > > > >> > >> > > > suddenly start deserializing. This is a price I'm > > willing > > > to > > > > >> pay. > > > > >> > >> > > > > > > > >> > >> > > > IMHO, we should then plan to switch the default of > > decimal > > > > >> > >> > serialization > > > > >> > >> > > to > > > > >> > >> > > > numeric, and text serialization to base 10 in the next > > > major > > > > >> > >> release. > > > > >> > >> > > > (With upgrade notes to match). Though I know this is > more > > > > >> > >> contentious, > > > > >> > >> > I > > > > >> > >> > > > think it moves us forward in a much more standard way > > that > > > > the > > > > >> > >> current > > > > >> > >> > > > encoding of decimals. > > > > >> > >> > > > > > > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra < > > > > al...@confluent.io> > > > > >> > >> wrote: > > > > >> > >> > > > > > > > >> > >> > > > > Hi Everyone! > > > > >> > >> > > > > > > > > >> > >> > > > > Kicking off discussion for a new KIP: > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON > > > > >> > >> > > > > > > > > >> > >> > > > > For those who are interested, I have a prototype > > > > >> implementation > > > > >> > >> that > > > > >> > >> > > > helped > > > > >> > >> > > > > guide my design: > > https://github.com/agavra/kafka/pull/1 > > > > >> > >> > > > > > > > > >> > >> > > > > Cheers, > > > > >> > >> > > > > Almog > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >