Re: [DISCUSS] KIP-82 - Add Record Headers

Colin McCabe Thu, 16 Feb 2017 10:29:55 -0800

+1 for varints here-- it would save quite a bit of space.  They are
pretty quick to implement as well.


I think it makes sense for values to be byte arrays.  Users might want
to attach arbitrary payloads; they shouldn't be forced to serialize
everything to Java strings.

best,
Colin


On Thu, Feb 16, 2017, at 09:52, Jason Gustafson wrote:
> Hey Michael,
> 
> Hmm, I guess the point of representing it as bytes is to allow the broker
> to pass it through opaquely? Is the cost of parsing them a concern, or
> are
> we simply trying to ensure that the broker stays agnostic to the format?
> 
> On varints, I think adding support for them makes less sense for an
> isolated use case, but as part of a more holistic change (such as what we
> have proposed in KIP-98), I think they are justifiable. If we add them,
> then the need to use attributes becomes quite a bit weaker, right? The
> other thing I find slightly odd is the fact that null headers has no
> actual
> semantic meaning for the message (unlike null keys and values). It is
> just
> a space optimization. It seems a bit better to always use size 0 to
> indicate having no headers.
> 
> Overall, the main point is ensuring that the message schema remains
> consistent, either within the larger protocol, or at a minimum within the
> message itself.
> 
> -Jason
> 
> On Thu, Feb 16, 2017 at 6:39 AM, Michael Pearce <[email protected]>
> wrote:
> 
> > Hi Jason,
> >
> > On point 1) in the message protocol the headers are simply a byte array,
> > as like the key or value, this is to clearly demarcate the header in the
> > core message. Then the header byte array in the core message is an array of
> > key, value pairs. This is what it is denoting.
> >
> > Then this would be I guess in the given notation:
> >
> > Headers => [KeyLength, Key, ValueLength, Value]
> >     KeyLength => int32 <-----------------NEW size of the byte[] of the
> > serialised key value
> >     Key => bytes <---------------------- NEW serialised string (UTF8)
> > bytes of the header key
> >     ValueLength => int32 <-------------- NEW size of the byte[] of the
> > serialised header value
> >     Value => bytes <-------------------- NEW serialised form of the header
> > value
> >
> > The key length and value length is matching the way the protocol is
> > defined in the core message currently.
> >
> >
> >
> >
> > On point 2)
> > Var sized ints, this was discussed much earlier on, in fact I had
> > suggested it myself (with Hadoop references), the complexity of this
> > compared to having a simpler protocol was argued and agreed it wasn’t worth
> > the complexity as all other clients in other languages would need to ensure
> > theyre using the right var size algorithm, as there is a few.
> >
> > On point 3)
> > We did the attributes, optional approach as originally there was marked
> > concern that headers would cause a message size overhead for others, who
> > don’t want them. As such this is the clean solution to achieve that. If
> > that no longer holds, and we don’t care that we add 4bytes overhead, then
> > im happy to remove.
> >
> > I’m personally in favour of keeping the message as small as possible so
> > people don’t get shocks in perf and throughputs dues to message size,
> > unless they actively use the feature, as such I do prefer the attribute bit
> > wise feature flag approach myself.
> >
> >
> >
> >
> > On 16/02/2017, 05:40, "Jason Gustafson" <[email protected]> wrote:
> >
> >     We have proposed a few significant changes to the message format in
> > KIP-98
> >     which now seems likely to pass (perhaps with some iterations on
> >     implementation details). It would be good to try and coordinate the
> > changes
> >     in both of the proposals to make sure they are consistent and
> > compatible.
> >
> >     I think using the attributes to indicate null headers is a reasonable
> >     approach. We have proposed to do the same thing for the message key and
> >     value. That said, I sympathize with Jay's argument. Having multiple
> > ways to
> >     specify a null value increases the overall complexity of the protocol.
> > You
> >     can see this just from the fact that you need the extra verbiage in the
> >     protocol specification in this KIP and in KIP-98 to describe the
> > dependence
> >     between the fields and the attributes. It seems like a slippery slope
> > if
> >     you start allowing different request types to implement the protocol
> >     specification differently.
> >
> >     You can also argue that the messages already are and are likely to
> > remain a
> >     special case. For example, there is currently no generality in how
> >     compressed message sets are represented that would be applicable for
> > other
> >     request types. Some might see this divergence as an unfortunate
> > protocol
> >     deficiency which should be fixed; others might see it as sort of the
> >     inevitability of needing to optimize where it counts most. I'm probably
> >     somewhere in between, but I think we probably all share the intuition
> > that
> >     the protocol should be kept as consistent as possible. With that in
> > mind,
> >     here are a few comments:
> >
> >     1. One thing I found a little odd when reading the current proposal is
> > that
> >     the headers are both represented as an array of bytes and as an array
> > of
> >     key/value pairs. I'd probably suggest something like this:
> >
> >     Headers => [HeaderKey HeaderValue]
> >      HeaderKey => String
> >      HeaderValue => Bytes
> >
> >     An array in the Kafka protocol is represented as a 4-byte integer
> >     indicating the number of elements in the array followed by the
> >     serialization of the elements. Unless I'm misunderstanding, what you
> > have
> >     instead is the total size of the headers in bytes followed by the
> > elements.
> >     I'm not sure I see any reason for this inconsistency.
> >
> >     2. In KIP-98, we've introduced variable-length integer fields.
> > Effectively,
> >     we've enriched (or "complicated" as Jay might say ;) the protocol
> >     specification to include the following types: VarInt, VarLong,
> >     UnsignedVarInt and UnsignedVarLong.
> >
> >     Along with these primitives, we could introduce the following types:
> >
> >     VarSizeArray => NumberOfItems Item1 Item2 .. ItemN
> >       NumberOfItems => UnsignedVarInt
> >
> >     VarSizeNullableArray => NumberOfItemsOrNull Item1 Item2 .. ItemN
> >       NumberOfItemsOrNull => VarInt (-1 means null)
> >
> >     And similarly for the `String` and `Bytes` types. These types can save
> > a
> >     considerable amount of space in this proposal because they can be used
> > for
> >     both the number of headers included in the message and the lengths of
> > the
> >     header keys and values. We could do this instead:
> >
> >     Headers => VarSizeArray[HeaderKey HeaderValue]
> >       HeaderKey => VarSizeString
> >       HeaderValue => VarSizeBytes
> >
> >     Combining the savings from the use of variable length fields, the
> > benefit
> >     of using the attributes to represent null seems pretty small.
> >
> >     3. Whichever way we go (whether we use the attributes or not), we
> > should at
> >     least be consistent between this KIP and KIP-98. It would be very
> > strange
> >     to have two ways to represent null values in the same schema. Either
> > way is
> >     OK with me. I think some message-level optimizations are justifiable,
> > but
> >     the savings here seem minimal (a few bytes per message), so maybe it's
> > not
> >     worth the cost of letting the message diverge even further from the
> > rest of
> >     the protocol.
> >
> >     -Jason
> >
> >
> >     On Wed, Feb 15, 2017 at 8:52 AM, radai <[email protected]>
> > wrote:
> >
> >     > I've trimmed the inline contents as this mail is getting too big for
> > the
> >     > apache mailing list software to deliver :-(
> >     >
> >     > 1. the important thing for interoperability is for different
> > "interested
> >     > parties" (plugins, infra layers/wrappers, user-code) to be able to
> > stick
> >     > pieces of metadata onto msgs without getting in each other's way. a
> > common
> >     > key scheme (Strings, as of the time of this writing?) is all thats
> > required
> >     > for that. it is assumed that the other end interested in any such
> > piece of
> >     > metadata knows the encoding, and byte[] provides for the most
> > flexibility.
> >     > i believe this is the same logic behind core kafka being
> > byte[]/byte[] -
> >     > Strings are more "usable" but bytes are flexible and so were chosen.
> >     > Also - core kafka doesnt even do that good of a job on usability of
> > the
> >     > payload (example - i have to specify the nop byte[] "decoders"
> > explicitly
> >     > in conf), and again sacrificies usability for the sake of
> > performance (no
> >     > convenient single-record processing as poll is a batch, lots of
> > obscure
> >     > little config details exposing internals of the batching mechanism,
> > etc)
> >     >
> >     > this is also why i really dislike the idea of a "type system" for
> > header
> >     > values, it further degrades the usability, adds complexity and will
> >     > eventually get in people's way, also, it would be the 2nd/3rd
> > home-group
> >     > serialization mechanism in core kafka (counting 2 iterations of the
> > "type
> >     > definition DSL")
> >     >
> >     > 2. this is an implementation detail, and not even a very "user
> > facing" one?
> >     > to the best of my understanding the vote process is on proposed
> >     > API/behaviour. also - since we're willing to go with strings just
> > serialize
> >     > a 0-sized header blob and IIUC you dont need any optionals anymore.
> >     >
> >     > 3. yes, we can :-)
> >     >
> >     > On Tue, Feb 14, 2017 at 11:56 PM, Michael Pearce <
> > [email protected]>
> >     > wrote:
> >     >
> >     > > Hi Jay,
> >     > >
> >     > > 1) There was some initial debate on the value part, as youll note
> > String,
> >     > > String headers were discounted early on. The reason for this is
> >     > flexibility
> >     > > and keeping in line with the flexibility of key, value of the
> > message
> >     > > object itself. I don’t think it takes away from an ecosystem as
> > each
> >     > plugin
> >     > > will care for their own key, this way ints, booleans , exotic
> > custom
> >     > binary
> >     > > can all be catered for=.
> >     > > a. If you really wanted to push for a typed value interface, I
> > wouldn’t
> >     > > want just String values supported, but the the primatives plus
> > string and
> >     > > also still keeping the ability to have a binary for custom
> > binaries that
> >     > > some organisations may have.
> >     > > i. I have written this slight alternative here,
> >     > https://cwiki.apache.org/
> >     > > confluence/display/KAFKA/KIP-82+-+Add+Record+Headers+-+Typed
> >     > > ii. Essentially the value bytes, has a leading byte overhead.
> >     > > 1.  This tells you what type the value is, before reading the rest
> > of the
> >     > > bytes, allowing serialisation/deserialization to and from the
> > primitives,
> >     > > string and byte[]. This is akin to some other messaging systems.
> >     > > 2) We are making it optional, so that for those not wanting
> > headers have
> >     > 0
> >     > > bytes overhead (think of it as a feature flag), I don’t think this
> > is
> >     > > complex, especially if comparing to changes proposed in other kips
> > like
> >     > > kip-98.
> >     > > a. If you really really don’t like this, we can drop it, but it
> > would
> >     > mean
> >     > > buying into 4 bytes extra overhead for users who do not want to use
> >     > headers.
> >     > > 3) In the summary yes, it is at a higher level, but I think this
> > is well
> >     > > documented in the proposed changes section.
> >     > > a. Added getHeaders method to Producer/Consumer record (that is it)
> >     > > b. We’ve also detailed the new Headers class that this method
> > returns
> >     > that
> >     > > encapsulates the headers protocol and logic.
> >     > >
> >     > > Best,
> >     > > Mike
> >     > >
> >     > > ==Original questions from the vote thread from Jay.==
> >     > >
> >     > > Couple of things I think we still need to work out:
> >     > >
> >     > >    1. I think we agree about the key, but I think we haven't
> > talked about
> >     > >    the value yet. I think if our goal is an open ecosystem of these
> >     > header
> >     > >    spread across many plugins from many systems we should consider
> > making
> >     > > this
> >     > >    a string as well so it can be printed, set via a UI, set in
> > config,
> >     > etc.
> >     > >    Basically encouraging pluggable serialization formats here will
> > lead
> >     > to
> >     > > a
> >     > >    bit of a tower of babel.
> >     > >    2. This proposal still includes a pretty big change to our
> >     > serialization
> >     > >    and protocol definition layer. Essentially it is introducing an
> >     > optional
> >     > >    type, where the format is data dependent. I think this is
> > actually a
> >     > big
> >     > >    change though it doesn't seem like it. It means you can no
> > longer
> >     > > specify
> >     > >    this type with our type definition DSL, and likewise it requires
> >     > custom
> >     > >    handling in client libs. This isn't a huge thing, since the
> > Record
> >     > >    definition is custom anyway, but I think this kind of protocol
> >     > >    inconsistency is very non-desirable and ties you to hand-coding
> >     > things.
> >     > > I
> >     > >    think the type should instead by [Key Value] in our BNF, where
> > key and
> >     > >    value are both short strings as used elsewhere. This brings it
> > in line
> >     > > with
> >     > >    the rest of the protocol.
> >     > >    3. Could we get more specific about the exact Java API change to
> >     > >    ProducerRecord, ConsumerRecord, Record, etc?
> >     > >
> >     > > -Jay
> >     > >
> >     >
> >
> >
> > The information contained in this email is strictly confidential and for
> > the use of the addressee only, unless otherwise indicated. If you are not
> > the intended recipient, please do not read, copy, use or disclose to others
> > this message or any attachment. Please also notify the sender by replying
> > to this email or by telephone (+44(020 7896 0011) and then delete the email
> > and any copies of it. Opinions, conclusion (etc) that do not relate to the
> > official business of this company shall be understood as neither given nor
> > endorsed by it. IG is a trading name of IG Markets Limited (a company
> > registered in England and Wales, company number 04008957) and IG Index
> > Limited (a company registered in England and Wales, company number
> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG
> > Index Limited (register number 114059) are authorised and regulated by the
> > Financial Conduct Authority.
> >

Re: [DISCUSS] KIP-82 - Add Record Headers

Reply via email to