Hi Colin,

Thanks for taking a look at it. Replies inline...

On Mon, Jun 15, 2020 at 6:22 PM Colin McCabe <cmcc...@apache.org> wrote:

> Hi Tom,
>
> It's an interesting idea.  Obviously protocol buffers does this for all
> numeric fields.
>
> I have to admit I have some mixed feelings, since this is another thing
> that makes encoding more complex.  And it's not a clear win in all cases,
> although it is in some.
>

Yeah, both these points are valid. Obviously I chose MetadataResponse
because the benefits are more apparent, but there are plenty of RPCs where
it wouldn't make so much difference. I provided the benchmark numbers to
try to make it a little easier, but of course they only illuminate the
MetadataResponse case. It wouldn't be so very much effort to generate
benchmarks for each RPC (driven off a JSON representation of a "typical"
message) which would make it easier to judge on a case-by-case basis (both
this and other protocol changes).

I assume that the performance numbers here are for the old Struct-based
> encoding / decoding system.  Do we know what the numbers are when using the
> generated read and write functions?
>

I've added them to the proposal.


> I don't think it makes sense for type to be version-dependent.  Type
> ultimately translates into what Java type we should use to represent the
> field when it's in POJO form.  We can't have two types there.
>

I've removed that section from the proposal.


> Making encoding version-dependent is reasonable.  I do sort of wonder if
> we should just have something like "packedVersions" : "9+" rather than the
> "encoding" thing, though.  The latter is more conceptually elegant but it
> seems like it would be a pain to use.  For example, for a new field you
> would have to type "encoding": { "0+" : "packed" } which is kind of ugly.
>

Yeah, I wondered about having this on the message level.

Having a message-level "packedVersions" wouldn't help with evolving types
from 32 to 64 bits. This is something I briefly alluded to in the initial
version of the KIP, but I've now elaborated on it. TBH I'm still not sure
how useful that would be (the review process seems to do a good job of
considering field sizes, so perhaps there are no fields where, in
hindsight, using a wider field would have been beneficial). So I've added
this for now, and am including the number of bits in the encoding name, but
I can easily drop this if it's irrelevant.

To alleviate the burden of using encoding for new fields I've allowed
"encoding" to support string values as well as JSON objects, e.g.
"encoding": "packed", as a shortcut when the same encoding was used for all
versions.

In any case, I'd be happy to implement support for this via a message-level
"packedVersions" is that's what the consensus of this discussion is.


> Another thing here is that if we are going to go all this way for
> optimization, we should certainly give people the choice of whether to use
> zigzag encoding or not.  For fields that can never be negative, zigzag
> encoding is a waste.  So you would then have the option of unpacked, signed
> packed, and unsigned packed.
>

That is what's proposed in the KIP, though maybe I explained it poorly.


> Finally, the Kafka protocol has a lot of fields which can never be
> negative, except for some special cases where they are -1.  But no other
> negative numbers are allowed.  So we should consider making the "unsigned
> packed" option actually encode num + 1 just to support this usage.  That's
> what we did for string and bytes prefix length encoding and it worked well.
>

I considered this for error codes, but there the -1 case
(UNKNOWN_SERVER_ERROR) would be rare enough that it didn't feel like it was
worth the effort. But I overlooked other cases where -1 is used in a
similar way. I think the hardest thing here would be to come up with a
sensible name for this encoding, since it's not plain unsigned packed. I
suppose the name could include the offset, i.e. upacked32+1 for example.


> Before we do all this, though, one simpler improvement would be making all
> the "error" fields into tagged fields.  Most of them remain at 0 most of
> the time, so this could very well provide a big savings without any big
> encoding changes.
>

Using a tag for error codes and messages is a good idea.

Kind regards,

Tom

Reply via email to