On Tue, Jun 16, 2020, at 07:42, Tom Bentley wrote: > Hi Colin, > > Thanks for taking a look at it. Replies inline... > > On Mon, Jun 15, 2020 at 6:22 PM Colin McCabe <cmcc...@apache.org> wrote: > > > Hi Tom, > > > > It's an interesting idea. Obviously protocol buffers does this for all > > numeric fields. > > > > I have to admit I have some mixed feelings, since this is another thing > > that makes encoding more complex. And it's not a clear win in all cases, > > although it is in some. > > > > Yeah, both these points are valid. Obviously I chose MetadataResponse > because the benefits are more apparent, but there are plenty of RPCs where > it wouldn't make so much difference. I provided the benchmark numbers to > try to make it a little easier, but of course they only illuminate the > MetadataResponse case. It wouldn't be so very much effort to generate > benchmarks for each RPC (driven off a JSON representation of a "typical" > message) which would make it easier to judge on a case-by-case basis (both > this and other protocol changes). > > I assume that the performance numbers here are for the old Struct-based > > encoding / decoding system. Do we know what the numbers are when using the > > generated read and write functions? > > > > I've added them to the proposal. > > > > I don't think it makes sense for type to be version-dependent. Type > > ultimately translates into what Java type we should use to represent the > > field when it's in POJO form. We can't have two types there. > > > > I've removed that section from the proposal. > > > > Making encoding version-dependent is reasonable. I do sort of wonder if > > we should just have something like "packedVersions" : "9+" rather than the > > "encoding" thing, though. The latter is more conceptually elegant but it > > seems like it would be a pain to use. For example, for a new field you > > would have to type "encoding": { "0+" : "packed" } which is kind of ugly. > > > > Yeah, I wondered about having this on the message level. >
Hi Tom, To be clear, I wasn't proposing having it on the message level, but on the level of individual fields, similar to nullableVersions. Although there may be some advantages to making it a per-message field, like flexibleVersions.... > > Having a message-level "packedVersions" wouldn't help with evolving types > from 32 to 64 bits. This is something I briefly alluded to in the initial > version of the KIP, but I've now elaborated on it. TBH I'm still not sure > how useful that would be (the review process seems to do a good job of > considering field sizes, so perhaps there are no fields where, in > hindsight, using a wider field would have been beneficial). So I've added > this for now, and am including the number of bits in the encoding name, but > I can easily drop this if it's irrelevant. > > To alleviate the burden of using encoding for new fields I've allowed > "encoding" to support string values as well as JSON objects, e.g. > "encoding": "packed", as a shortcut when the same encoding was used for all > versions. > I have mixed feelings about this, since it makes the JSON parsing a bit more complex. I guess it's probably worth it, given the gain in expressiveness. It's not hard to support from Jackson but consuming the JSON from other places might get harder. > > In any case, I'd be happy to implement support for this via a message-level > "packedVersions" is that's what the consensus of this discussion is. > > > > Another thing here is that if we are going to go all this way for > > optimization, we should certainly give people the choice of whether to use > > zigzag encoding or not. For fields that can never be negative, zigzag > > encoding is a waste. So you would then have the option of unpacked, signed > > packed, and unsigned packed. > > > > That is what's proposed in the KIP, though maybe I explained it poorly. > > > > Finally, the Kafka protocol has a lot of fields which can never be > > negative, except for some special cases where they are -1. But no other > > negative numbers are allowed. So we should consider making the "unsigned > > packed" option actually encode num + 1 just to support this usage. That's > > what we did for string and bytes prefix length encoding and it worked well. > > > > I considered this for error codes, but there the -1 case > (UNKNOWN_SERVER_ERROR) would be rare enough that it didn't feel like it was > worth the effort. But I overlooked other cases where -1 is used in a > similar way. I think the hardest thing here would be to come up with a > sensible name for this encoding, since it's not plain unsigned packed. I > suppose the name could include the offset, i.e. upacked32+1 for example. > Hmm. I think it would be reasonable to just have unsigned packed always work this way. There's not any major disadvantage. > > > Before we do all this, though, one simpler improvement would be making all > > the "error" fields into tagged fields. Most of them remain at 0 most of > > the time, so this could very well provide a big savings without any big > > encoding changes. > > > > Using a tag for error codes and messages is a good idea. > I guess I was suggesting doing this first, since it's easy? :) best, Colin > > Kind regards, > > Tom >