I think it's well known I've been pushing for ints (and I could switch to
16 bit shorts if pressed).

- efficient (space)
- efficient (processing)
- easily partitionable


However, if the only thing that is keeping us from adopting headers is the
use of strings vs ints as keys, then I would cave in and accept strings. If
we do so, I would like to limit string keys to 128 bytes in length.  This
way 1) I could use a 3 letter string if I wanted (effectively using 4 total
bytes), 2) limit overall impact of possible keys (don't really want people
to send a 16K header string key).

Nacho


On Tue, Nov 8, 2016 at 3:35 PM, Gwen Shapira <g...@confluent.io> wrote:

> Forgot to mention: Thank you for quantifying the trade-off - it is
> helpful and important regardless of what we end up deciding.
>
> On Tue, Nov 8, 2016 at 3:12 PM, Sean McCauliff
> <smccaul...@linkedin.com.invalid> wrote:
> > On Tue, Nov 8, 2016 at 2:15 PM, Gwen Shapira <g...@confluent.io> wrote:
> >
> >> Since Kafka specifically targets high-throughput, low-latency
> >> use-cases, I don't think we should trade them off that easily.
> >>
> >
> > I find these kind of design goals not to be really helpful unless it's
> > quantified in someway.  Because it's always possible to argue against
> > something as either being not performant or just an implementation
> detail.
> >
> > This is a single threaded benchmarks so all the measurements are per
> > thread.
> >
> > For 1M messages/s/thread  if header keys are int and you had even a
> single
> > header key, value pair then it's still about 2^-2 microseconds which
> means
> > you only have another 0.75 microseconds to do everything else you want to
> > do with a message (1M messages/s means 1 micro second per message).  With
> > string header keys there is still 0.5 micro seconds to process a message.
> >
> >
> >
> > I love strings as much as the next guy (we had them in Flume), but I
> >> was convinced by Magnus/Michael/Radai that strings don't actually have
> >> strong benefits as opposed to ints (you'll need a string registry
> >> anyway - otherwise, how will you know what does the "profile_id"
> >> header refers to?) and I want to keep closer to our original design
> >> goals for Kafka.
> >>
> >
> > "confluent.profile_id"
> >
> >
> >>
> >> If someone likes strings in the headers and doesn't do millions of
> >> messages a sec, they probably have lots of other systems they can use
> >> instead.
> >>
> >
> > None of them will scale like Kafka.  Horizontal scaling is still good.
> >
> >
> >>
> >>
> >> On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff
> >> <smccaul...@linkedin.com.invalid> wrote:
> >> > +1 for String keys.
> >> >
> >> > I've been doing some bechmarking and it seems like the speedup for
> using
> >> > integer keys is about 2-5 depending on the length of the strings and
> what
> >> > collections are being used.  The overall amount of time spent parsing
> a
> >> set
> >> > of header key, value pairs probably does not matter unless you are
> >> getting
> >> > close to 1M messages per consumer.  In which case probably don't use
> >> > headers.  There is also the option to use very short strings; some
> that
> >> are
> >> > even shorter than integers.
> >> >
> >> > Partitioning the string key space will be easier than partitioning an
> >> > integer key space. We won't need a global registry.  Kafka internally
> can
> >> > reserve some prefix like "_" as its namespace.  Everyone else can use
> >> their
> >> > company or project name as namespace prefix and life should be good.
> >> >
> >> > Here's the link to some of the benchmarking info:
> >> > https://docs.google.com/document/d/1tfT-
> 6SZdnKOLyWGDH82kS30PnUkmgb7nPL
> >> dw6p65pAI/edit?usp=sharing
> >> >
> >> >
> >> >
> >> > --
> >> > Sean McCauliff
> >> > Staff Software Engineer
> >> > Kafka
> >> >
> >> > smccaul...@linkedin.com
> >> > linkedin.com/in/sean-mccauliff-b563192
> >> >
> >> > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <
> michael.pea...@ig.com>
> >> > wrote:
> >> >
> >> >> +1 on this slimmer version of our proposal
> >> >>
> >> >> I def think the Id space we can reduce from the proposed
> int32(4bytes)
> >> >> down to int16(2bytes) it saves on space and as headers we wouldn't
> >> expect
> >> >> the number of headers being used concurrently being that high.
> >> >>
> >> >> I would wonder if we should make the value byte array length still
> int32
> >> >> though as This is the standard Max array length in Java saying that
> it
> >> is a
> >> >> header and I guess limiting the size is sensible and would work for
> all
> >> the
> >> >> use cases we have in mind so happy with limiting this.
> >> >>
> >> >> Do people generally concur on Magnus's slimmer version? Anyone see
> any
> >> >> issues if we moved from int32 to int16?
> >> >>
> >> >> Re configurable ids per plugin over a global registry also would work
> >> for
> >> >> us.  As such if this has better concensus over the proposed global
> >> registry
> >> >> I'd be happy to change that.
> >> >>
> >> >> I was already sold on ints over strings for keys ;)
> >> >>
> >> >> Cheers
> >> >> Mike
> >> >>
> >> >> ________________________________________
> >> >> From: Magnus Edenhill <mag...@edenhill.se>
> >> >> Sent: Monday, November 7, 2016 10:10:21 PM
> >> >> To: dev@kafka.apache.org
> >> >> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers
> >> >>
> >> >> Hi,
> >> >>
> >> >> I'm +1 for adding generic message headers, but I do share the
> concerns
> >> >> previously aired on this thread and during the KIP meeting.
> >> >>
> >> >> So let me propose a slimmer alternative that does not require any
> sort
> >> of
> >> >> global header registry, does not affect broker performance or
> >> operations,
> >> >> and adds as little overhead as possible.
> >> >>
> >> >>
> >> >> Message
> >> >> ------------
> >> >> The protocol Message type is extended with a Headers array consting
> of
> >> >> Tags, where a Tag is defined as:
> >> >>    int16 Id
> >> >>    int16 Len              // binary_data length
> >> >>    binary_data[Len]  // opaque binary data
> >> >>
> >> >>
> >> >> Ids
> >> >> ---
> >> >> The Id space is not centrally managed, so whenever an application
> needs
> >> to
> >> >> add headers, or use an eco-system plugin that does, its Id allocation
> >> will
> >> >> need to be manually configured.
> >> >> This moves the allocation concern from the global space down to
> >> >> organization level and avoids the risk for id conflicts.
> >> >> Example pseudo-config for some app:
> >> >>     sometrackerplugin.tag.sourcev3.id=1000
> >> >>     dbthing.tag.tablename.id=1001
> >> >>     myschemareg.tag.schemaname.id=1002
> >> >>     myschemareg.tag.schemaversion.id=1003
> >> >>
> >> >>
> >> >> Each header-writing or header-reading plugin must provide means
> >> (typically
> >> >> through configuration) to specify the tag for each header it uses.
> >> Defaults
> >> >> should be avoided.
> >> >> A consumer silently ignores tags it does not have a mapping for
> (since
> >> the
> >> >> binary_data can't be parsed without knowing what it is).
> >> >>
> >> >> Id range 0..999 is reserved for future use by the broker and must
> not be
> >> >> used by plugins.
> >> >>
> >> >>
> >> >>
> >> >> Broker
> >> >> ---------
> >> >> The broker does not process the tags (other than the standard
> protocol
> >> >> syntax verification), it simply stores and forwards them as opaque
> data.
> >> >>
> >> >> Standard message translation (removal of Headers) kicks in for older
> >> >> clients.
> >> >>
> >> >>
> >> >> Why not string ids?
> >> >> -------------------------
> >> >> String ids might seem like a good idea, but:
> >> >>  * does not really solve uniqueness
> >> >>  * consumes a lot of space (2 byte string length + string, per
> header)
> >> to
> >> >> be meaningful
> >> >>  * doesn't really say anything how to parse the tag's data, so it is
> in
> >> >> effect useless on its own.
> >> >>
> >> >>
> >> >> Regards,
> >> >> Magnus
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>:
> >> >>
> >> >> > Hi Roger,
> >> >> >
> >> >> > Thanks for the support.
> >> >> >
> >> >> > I think the key thing is to have a common key space to make an
> >> ecosystem,
> >> >> > there does have to be some level of contract for people to play
> >> nicely.
> >> >> >
> >> >> > Having map<String, byte[]> or as per current proposed in kip of
> >> having a
> >> >> > numerical key space of  map<int, byte[]> is a level of the contract
> >> that
> >> >> > most people would expect.
> >> >> >
> >> >> > I think the example in a previous comment someone else made
> linking to
> >> >> AWS
> >> >> > blog and also implemented api where originally they didn’t have a
> >> header
> >> >> > space but not they do, where keys are uniform but the value can be
> >> >> string,
> >> >> > int, anything is a good example.
> >> >> >
> >> >> > Having a custom MetadataSerializer is something we had played with,
> >> but
> >> >> > discounted the idea, as if you wanted everyone to work the same
> way in
> >> >> the
> >> >> > ecosystem, having to have this also customizable makes it a bit
> >> harder.
> >> >> > Think about making the whole message record custom serializable,
> this
> >> >> would
> >> >> > make it fairly tricky (though it would not be impossible) to have
> made
> >> >> work
> >> >> > nicely. Having the value customizable we thought is a reasonable
> >> tradeoff
> >> >> > here of flexibility over contract of interaction between different
> >> >> parties.
> >> >> >
> >> >> > Is there a particular case or benefit of having serialization
> >> >> customizable
> >> >> > that you have in mind?
> >> >> >
> >> >> > Saying this it is obviously something that could be implemented, if
> >> there
> >> >> > is a need. If we did go this avenue I think a defaulted serializer
> >> >> > implementation should exist so for the 80:20 rule, people can just
> >> have
> >> >> the
> >> >> > broker and clients get default behavior.
> >> >> >
> >> >> > Cheers
> >> >> > Mike
> >> >> >
> >> >> > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote:
> >> >> >
> >> >> >     making header _key_ serialization configurable potentially
> >> undermines
> >> >> > the
> >> >> >     board usefulness of the feature (any point along the path must
> be
> >> >> able
> >> >> > to
> >> >> >     read the header keys. the values may be whatever and require
> more
> >> >> > intimate
> >> >> >     knowledge of the code that produced specific headers, but keys
> >> should
> >> >> > be
> >> >> >     universally readable).
> >> >> >
> >> >> >     it would also make it hard to write really portable plugins -
> say
> >> i
> >> >> > wrote a
> >> >> >     large message splitter/combiner - if i rely on key
> "largeMessage"
> >> and
> >> >> >     values of the form "1/20" someone who uses (contrived example)
> >> >> > Map<Byte[],
> >> >> >     Double> wouldnt be able to re-use my code.
> >> >> >
> >> >> >     not the end of a the world within an organization, but
> >> problematic if
> >> >> > you
> >> >> >     want to enable an ecosystem
> >> >> >
> >> >> >     On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover <
> >> roger.hoo...@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >     >  As others have laid out, I see strong reasons for a common
> >> message
> >> >> >     > metadata structure for the Kafka ecosystem.  In particular,
> I've
> >> >> > seen that
> >> >> >     > even within a single organization, infrastructure teams often
> >> own
> >> >> the
> >> >> >     > message metadata while application teams own the
> >> application-level
> >> >> > data
> >> >> >     > format.  Allowing metadata and content to have different
> >> structure
> >> >> > and
> >> >> >     > evolve separately is very helpful for this.  Also, I think
> >> there's
> >> >> a
> >> >> > lot of
> >> >> >     > value to having a common metadata structure shared across the
> >> Kafka
> >> >> >     > ecosystem so that tools which leverage metadata can more
> easily
> >> be
> >> >> > shared
> >> >> >     > across organizations and integrated together.
> >> >> >     >
> >> >> >     > The question is, where does the metadata structure belong?
> >> Here's
> >> >> > my take:
> >> >> >     >
> >> >> >     > We change the Kafka wire and on-disk format to from a (key,
> >> value)
> >> >> > model to
> >> >> >     > a (key, metadata, value) model where all three are byte
> arrays
> >> from
> >> >> > the
> >> >> >     > brokers point of view.  The primary reason for this is that
> it
> >> >> > provides a
> >> >> >     > backward compatible migration path forward.  Producers can
> start
> >> >> > populating
> >> >> >     > metadata fields before all consumers understand the metadata
> >> >> > structure.
> >> >> >     > For people who already have custom envelope structures, they
> can
> >> >> > populate
> >> >> >     > their existing structure and the new structure for a while as
> >> they
> >> >> > make the
> >> >> >     > transition.
> >> >> >     >
> >> >> >     > We could stop there and let the clients plug in a
> KeySerializer,
> >> >> >     > MetadataSerializer, and ValueSerializer but I think it is
> also
> >> be
> >> >> > useful to
> >> >> >     > have a default MetadataSerializer that implements a key-value
> >> model
> >> >> > similar
> >> >> >     > to AMQP or HTTP headers.  Or we could go even further and
> >> >> prescribe a
> >> >> >     > Map<String, byte[]> or Map<String, String> data model for
> >> headers
> >> >> in
> >> >> > the
> >> >> >     > clients (while still allowing custom serialization of the
> header
> >> >> data
> >> >> >     > model).
> >> >> >     >
> >> >> >     > I think this would address Radai's concerns:
> >> >> >     > 1. All client code would not need to be updated to know about
> >> the
> >> >> >     > container.
> >> >> >     > 2. Middleware friendly clients would have a standard header
> data
> >> >> > model to
> >> >> >     > work with.
> >> >> >     > 3. KIP is required both b/c of broker changes and because of
> >> client
> >> >> > API
> >> >> >     > changes.
> >> >> >     >
> >> >> >     > Cheers,
> >> >> >     >
> >> >> >     > Roger
> >> >> >     >
> >> >> >     >
> >> >> >     > On Wed, Nov 2, 2016 at 4:38 PM, radai <
> >> radai.rosenbl...@gmail.com>
> >> >> > wrote:
> >> >> >     >
> >> >> >     > > my biggest issues with a "standard" wrapper format:
> >> >> >     > >
> >> >> >     > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version)
> must
> >> be
> >> >> > updated
> >> >> >     > to
> >> >> >     > > know about the container, because any old naive code
> trying to
> >> >> > directly
> >> >> >     > > deserialize its own payload would keel over and die (it
> needs
> >> to
> >> >> > know to
> >> >> >     > > deserialize a container, and then dig in there for its
> >> payload).
> >> >> >     > > 2. in order to write middleware-friendly clients that
> utilize
> >> >> such
> >> >> > a
> >> >> >     > > container one would basically have to write their own
> >> >> > producer/consumer
> >> >> >     > API
> >> >> >     > > on top of the open source kafka one.
> >> >> >     > > 3. if you were going to go with a wrapper format you really
> >> dont
> >> >> > need to
> >> >> >     > > bother with a kip (just open source your own client stack
> >> from #2
> >> >> > above
> >> >> >     > so
> >> >> >     > > others could stop re-inventing it)
> >> >> >     > >
> >> >> >     > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng <
> >> >> wushuja...@gmail.com>
> >> >> >     > wrote:
> >> >> >     > >
> >> >> >     > > > How exactly would this work? Or maybe that's out of scope
> >> for
> >> >> > this
> >> >> >     > email.
> >> >> >     > >
> >> >> >     >
> >> >> >
> >> >> >
> >> >> > The information contained in this email is strictly confidential
> and
> >> for
> >> >> > the use of the addressee only, unless otherwise indicated. If you
> are
> >> not
> >> >> > the intended recipient, please do not read, copy, use or disclose
> to
> >> >> others
> >> >> > this message or any attachment. Please also notify the sender by
> >> replying
> >> >> > to this email or by telephone (+44(020 7896 0011) and then delete
> the
> >> >> email
> >> >> > and any copies of it. Opinions, conclusion (etc) that do not
> relate to
> >> >> the
> >> >> > official business of this company shall be understood as neither
> given
> >> >> nor
> >> >> > endorsed by it. IG is a trading name of IG Markets Limited (a
> company
> >> >> > registered in England and Wales, company number 04008957) and IG
> Index
> >> >> > Limited (a company registered in England and Wales, company number
> >> >> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate
> Hill,
> >> >> > London EC4R 2YA. Both IG Markets Limited (register number 195355)
> and
> >> IG
> >> >> > Index Limited (register number 114059) are authorised and
> regulated by
> >> >> the
> >> >> > Financial Conduct Authority.
> >> >> >
> >> >> The information contained in this email is strictly confidential and
> for
> >> >> the use of the addressee only, unless otherwise indicated. If you are
> >> not
> >> >> the intended recipient, please do not read, copy, use or disclose to
> >> others
> >> >> this message or any attachment. Please also notify the sender by
> >> replying
> >> >> to this email or by telephone (+44(020 7896 0011) and then delete the
> >> email
> >> >> and any copies of it. Opinions, conclusion (etc) that do not relate
> to
> >> the
> >> >> official business of this company shall be understood as neither
> given
> >> nor
> >> >> endorsed by it. IG is a trading name of IG Markets Limited (a company
> >> >> registered in England and Wales, company number 04008957) and IG
> Index
> >> >> Limited (a company registered in England and Wales, company number
> >> >> 01190902). Registered address at Cannon Bridge House, 25 Dowgate
> Hill,
> >> >> London EC4R 2YA. Both IG Markets Limited (register number 195355)
> and IG
> >> >> Index Limited (register number 114059) are authorised and regulated
> by
> >> the
> >> >> Financial Conduct Authority.
> >> >>
> >>
> >>
> >>
> >> --
> >> Gwen Shapira
> >> Product Manager | Confluent
> >> 650.450.2760 | @gwenshap
> >> Follow us: Twitter | blog
> >>
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>



-- 
Nacho (Ignacio) Solis
Kafka
nso...@linkedin.com

Reply via email to