I think it's well known I've been pushing for ints (and I could switch to 16 bit shorts if pressed).
- efficient (space) - efficient (processing) - easily partitionable However, if the only thing that is keeping us from adopting headers is the use of strings vs ints as keys, then I would cave in and accept strings. If we do so, I would like to limit string keys to 128 bytes in length. This way 1) I could use a 3 letter string if I wanted (effectively using 4 total bytes), 2) limit overall impact of possible keys (don't really want people to send a 16K header string key). Nacho On Tue, Nov 8, 2016 at 3:35 PM, Gwen Shapira <g...@confluent.io> wrote: > Forgot to mention: Thank you for quantifying the trade-off - it is > helpful and important regardless of what we end up deciding. > > On Tue, Nov 8, 2016 at 3:12 PM, Sean McCauliff > <smccaul...@linkedin.com.invalid> wrote: > > On Tue, Nov 8, 2016 at 2:15 PM, Gwen Shapira <g...@confluent.io> wrote: > > > >> Since Kafka specifically targets high-throughput, low-latency > >> use-cases, I don't think we should trade them off that easily. > >> > > > > I find these kind of design goals not to be really helpful unless it's > > quantified in someway. Because it's always possible to argue against > > something as either being not performant or just an implementation > detail. > > > > This is a single threaded benchmarks so all the measurements are per > > thread. > > > > For 1M messages/s/thread if header keys are int and you had even a > single > > header key, value pair then it's still about 2^-2 microseconds which > means > > you only have another 0.75 microseconds to do everything else you want to > > do with a message (1M messages/s means 1 micro second per message). With > > string header keys there is still 0.5 micro seconds to process a message. > > > > > > > > I love strings as much as the next guy (we had them in Flume), but I > >> was convinced by Magnus/Michael/Radai that strings don't actually have > >> strong benefits as opposed to ints (you'll need a string registry > >> anyway - otherwise, how will you know what does the "profile_id" > >> header refers to?) and I want to keep closer to our original design > >> goals for Kafka. > >> > > > > "confluent.profile_id" > > > > > >> > >> If someone likes strings in the headers and doesn't do millions of > >> messages a sec, they probably have lots of other systems they can use > >> instead. > >> > > > > None of them will scale like Kafka. Horizontal scaling is still good. > > > > > >> > >> > >> On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff > >> <smccaul...@linkedin.com.invalid> wrote: > >> > +1 for String keys. > >> > > >> > I've been doing some bechmarking and it seems like the speedup for > using > >> > integer keys is about 2-5 depending on the length of the strings and > what > >> > collections are being used. The overall amount of time spent parsing > a > >> set > >> > of header key, value pairs probably does not matter unless you are > >> getting > >> > close to 1M messages per consumer. In which case probably don't use > >> > headers. There is also the option to use very short strings; some > that > >> are > >> > even shorter than integers. > >> > > >> > Partitioning the string key space will be easier than partitioning an > >> > integer key space. We won't need a global registry. Kafka internally > can > >> > reserve some prefix like "_" as its namespace. Everyone else can use > >> their > >> > company or project name as namespace prefix and life should be good. > >> > > >> > Here's the link to some of the benchmarking info: > >> > https://docs.google.com/document/d/1tfT- > 6SZdnKOLyWGDH82kS30PnUkmgb7nPL > >> dw6p65pAI/edit?usp=sharing > >> > > >> > > >> > > >> > -- > >> > Sean McCauliff > >> > Staff Software Engineer > >> > Kafka > >> > > >> > smccaul...@linkedin.com > >> > linkedin.com/in/sean-mccauliff-b563192 > >> > > >> > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce < > michael.pea...@ig.com> > >> > wrote: > >> > > >> >> +1 on this slimmer version of our proposal > >> >> > >> >> I def think the Id space we can reduce from the proposed > int32(4bytes) > >> >> down to int16(2bytes) it saves on space and as headers we wouldn't > >> expect > >> >> the number of headers being used concurrently being that high. > >> >> > >> >> I would wonder if we should make the value byte array length still > int32 > >> >> though as This is the standard Max array length in Java saying that > it > >> is a > >> >> header and I guess limiting the size is sensible and would work for > all > >> the > >> >> use cases we have in mind so happy with limiting this. > >> >> > >> >> Do people generally concur on Magnus's slimmer version? Anyone see > any > >> >> issues if we moved from int32 to int16? > >> >> > >> >> Re configurable ids per plugin over a global registry also would work > >> for > >> >> us. As such if this has better concensus over the proposed global > >> registry > >> >> I'd be happy to change that. > >> >> > >> >> I was already sold on ints over strings for keys ;) > >> >> > >> >> Cheers > >> >> Mike > >> >> > >> >> ________________________________________ > >> >> From: Magnus Edenhill <mag...@edenhill.se> > >> >> Sent: Monday, November 7, 2016 10:10:21 PM > >> >> To: dev@kafka.apache.org > >> >> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers > >> >> > >> >> Hi, > >> >> > >> >> I'm +1 for adding generic message headers, but I do share the > concerns > >> >> previously aired on this thread and during the KIP meeting. > >> >> > >> >> So let me propose a slimmer alternative that does not require any > sort > >> of > >> >> global header registry, does not affect broker performance or > >> operations, > >> >> and adds as little overhead as possible. > >> >> > >> >> > >> >> Message > >> >> ------------ > >> >> The protocol Message type is extended with a Headers array consting > of > >> >> Tags, where a Tag is defined as: > >> >> int16 Id > >> >> int16 Len // binary_data length > >> >> binary_data[Len] // opaque binary data > >> >> > >> >> > >> >> Ids > >> >> --- > >> >> The Id space is not centrally managed, so whenever an application > needs > >> to > >> >> add headers, or use an eco-system plugin that does, its Id allocation > >> will > >> >> need to be manually configured. > >> >> This moves the allocation concern from the global space down to > >> >> organization level and avoids the risk for id conflicts. > >> >> Example pseudo-config for some app: > >> >> sometrackerplugin.tag.sourcev3.id=1000 > >> >> dbthing.tag.tablename.id=1001 > >> >> myschemareg.tag.schemaname.id=1002 > >> >> myschemareg.tag.schemaversion.id=1003 > >> >> > >> >> > >> >> Each header-writing or header-reading plugin must provide means > >> (typically > >> >> through configuration) to specify the tag for each header it uses. > >> Defaults > >> >> should be avoided. > >> >> A consumer silently ignores tags it does not have a mapping for > (since > >> the > >> >> binary_data can't be parsed without knowing what it is). > >> >> > >> >> Id range 0..999 is reserved for future use by the broker and must > not be > >> >> used by plugins. > >> >> > >> >> > >> >> > >> >> Broker > >> >> --------- > >> >> The broker does not process the tags (other than the standard > protocol > >> >> syntax verification), it simply stores and forwards them as opaque > data. > >> >> > >> >> Standard message translation (removal of Headers) kicks in for older > >> >> clients. > >> >> > >> >> > >> >> Why not string ids? > >> >> ------------------------- > >> >> String ids might seem like a good idea, but: > >> >> * does not really solve uniqueness > >> >> * consumes a lot of space (2 byte string length + string, per > header) > >> to > >> >> be meaningful > >> >> * doesn't really say anything how to parse the tag's data, so it is > in > >> >> effect useless on its own. > >> >> > >> >> > >> >> Regards, > >> >> Magnus > >> >> > >> >> > >> >> > >> >> > >> >> 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>: > >> >> > >> >> > Hi Roger, > >> >> > > >> >> > Thanks for the support. > >> >> > > >> >> > I think the key thing is to have a common key space to make an > >> ecosystem, > >> >> > there does have to be some level of contract for people to play > >> nicely. > >> >> > > >> >> > Having map<String, byte[]> or as per current proposed in kip of > >> having a > >> >> > numerical key space of map<int, byte[]> is a level of the contract > >> that > >> >> > most people would expect. > >> >> > > >> >> > I think the example in a previous comment someone else made > linking to > >> >> AWS > >> >> > blog and also implemented api where originally they didn’t have a > >> header > >> >> > space but not they do, where keys are uniform but the value can be > >> >> string, > >> >> > int, anything is a good example. > >> >> > > >> >> > Having a custom MetadataSerializer is something we had played with, > >> but > >> >> > discounted the idea, as if you wanted everyone to work the same > way in > >> >> the > >> >> > ecosystem, having to have this also customizable makes it a bit > >> harder. > >> >> > Think about making the whole message record custom serializable, > this > >> >> would > >> >> > make it fairly tricky (though it would not be impossible) to have > made > >> >> work > >> >> > nicely. Having the value customizable we thought is a reasonable > >> tradeoff > >> >> > here of flexibility over contract of interaction between different > >> >> parties. > >> >> > > >> >> > Is there a particular case or benefit of having serialization > >> >> customizable > >> >> > that you have in mind? > >> >> > > >> >> > Saying this it is obviously something that could be implemented, if > >> there > >> >> > is a need. If we did go this avenue I think a defaulted serializer > >> >> > implementation should exist so for the 80:20 rule, people can just > >> have > >> >> the > >> >> > broker and clients get default behavior. > >> >> > > >> >> > Cheers > >> >> > Mike > >> >> > > >> >> > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote: > >> >> > > >> >> > making header _key_ serialization configurable potentially > >> undermines > >> >> > the > >> >> > board usefulness of the feature (any point along the path must > be > >> >> able > >> >> > to > >> >> > read the header keys. the values may be whatever and require > more > >> >> > intimate > >> >> > knowledge of the code that produced specific headers, but keys > >> should > >> >> > be > >> >> > universally readable). > >> >> > > >> >> > it would also make it hard to write really portable plugins - > say > >> i > >> >> > wrote a > >> >> > large message splitter/combiner - if i rely on key > "largeMessage" > >> and > >> >> > values of the form "1/20" someone who uses (contrived example) > >> >> > Map<Byte[], > >> >> > Double> wouldnt be able to re-use my code. > >> >> > > >> >> > not the end of a the world within an organization, but > >> problematic if > >> >> > you > >> >> > want to enable an ecosystem > >> >> > > >> >> > On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover < > >> roger.hoo...@gmail.com > >> >> > > >> >> > wrote: > >> >> > > >> >> > > As others have laid out, I see strong reasons for a common > >> message > >> >> > > metadata structure for the Kafka ecosystem. In particular, > I've > >> >> > seen that > >> >> > > even within a single organization, infrastructure teams often > >> own > >> >> the > >> >> > > message metadata while application teams own the > >> application-level > >> >> > data > >> >> > > format. Allowing metadata and content to have different > >> structure > >> >> > and > >> >> > > evolve separately is very helpful for this. Also, I think > >> there's > >> >> a > >> >> > lot of > >> >> > > value to having a common metadata structure shared across the > >> Kafka > >> >> > > ecosystem so that tools which leverage metadata can more > easily > >> be > >> >> > shared > >> >> > > across organizations and integrated together. > >> >> > > > >> >> > > The question is, where does the metadata structure belong? > >> Here's > >> >> > my take: > >> >> > > > >> >> > > We change the Kafka wire and on-disk format to from a (key, > >> value) > >> >> > model to > >> >> > > a (key, metadata, value) model where all three are byte > arrays > >> from > >> >> > the > >> >> > > brokers point of view. The primary reason for this is that > it > >> >> > provides a > >> >> > > backward compatible migration path forward. Producers can > start > >> >> > populating > >> >> > > metadata fields before all consumers understand the metadata > >> >> > structure. > >> >> > > For people who already have custom envelope structures, they > can > >> >> > populate > >> >> > > their existing structure and the new structure for a while as > >> they > >> >> > make the > >> >> > > transition. > >> >> > > > >> >> > > We could stop there and let the clients plug in a > KeySerializer, > >> >> > > MetadataSerializer, and ValueSerializer but I think it is > also > >> be > >> >> > useful to > >> >> > > have a default MetadataSerializer that implements a key-value > >> model > >> >> > similar > >> >> > > to AMQP or HTTP headers. Or we could go even further and > >> >> prescribe a > >> >> > > Map<String, byte[]> or Map<String, String> data model for > >> headers > >> >> in > >> >> > the > >> >> > > clients (while still allowing custom serialization of the > header > >> >> data > >> >> > > model). > >> >> > > > >> >> > > I think this would address Radai's concerns: > >> >> > > 1. All client code would not need to be updated to know about > >> the > >> >> > > container. > >> >> > > 2. Middleware friendly clients would have a standard header > data > >> >> > model to > >> >> > > work with. > >> >> > > 3. KIP is required both b/c of broker changes and because of > >> client > >> >> > API > >> >> > > changes. > >> >> > > > >> >> > > Cheers, > >> >> > > > >> >> > > Roger > >> >> > > > >> >> > > > >> >> > > On Wed, Nov 2, 2016 at 4:38 PM, radai < > >> radai.rosenbl...@gmail.com> > >> >> > wrote: > >> >> > > > >> >> > > > my biggest issues with a "standard" wrapper format: > >> >> > > > > >> >> > > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) > must > >> be > >> >> > updated > >> >> > > to > >> >> > > > know about the container, because any old naive code > trying to > >> >> > directly > >> >> > > > deserialize its own payload would keel over and die (it > needs > >> to > >> >> > know to > >> >> > > > deserialize a container, and then dig in there for its > >> payload). > >> >> > > > 2. in order to write middleware-friendly clients that > utilize > >> >> such > >> >> > a > >> >> > > > container one would basically have to write their own > >> >> > producer/consumer > >> >> > > API > >> >> > > > on top of the open source kafka one. > >> >> > > > 3. if you were going to go with a wrapper format you really > >> dont > >> >> > need to > >> >> > > > bother with a kip (just open source your own client stack > >> from #2 > >> >> > above > >> >> > > so > >> >> > > > others could stop re-inventing it) > >> >> > > > > >> >> > > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng < > >> >> wushuja...@gmail.com> > >> >> > > wrote: > >> >> > > > > >> >> > > > > How exactly would this work? Or maybe that's out of scope > >> for > >> >> > this > >> >> > > email. > >> >> > > > > >> >> > > > >> >> > > >> >> > > >> >> > The information contained in this email is strictly confidential > and > >> for > >> >> > the use of the addressee only, unless otherwise indicated. If you > are > >> not > >> >> > the intended recipient, please do not read, copy, use or disclose > to > >> >> others > >> >> > this message or any attachment. Please also notify the sender by > >> replying > >> >> > to this email or by telephone (+44(020 7896 0011) and then delete > the > >> >> email > >> >> > and any copies of it. Opinions, conclusion (etc) that do not > relate to > >> >> the > >> >> > official business of this company shall be understood as neither > given > >> >> nor > >> >> > endorsed by it. IG is a trading name of IG Markets Limited (a > company > >> >> > registered in England and Wales, company number 04008957) and IG > Index > >> >> > Limited (a company registered in England and Wales, company number > >> >> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate > Hill, > >> >> > London EC4R 2YA. Both IG Markets Limited (register number 195355) > and > >> IG > >> >> > Index Limited (register number 114059) are authorised and > regulated by > >> >> the > >> >> > Financial Conduct Authority. > >> >> > > >> >> The information contained in this email is strictly confidential and > for > >> >> the use of the addressee only, unless otherwise indicated. If you are > >> not > >> >> the intended recipient, please do not read, copy, use or disclose to > >> others > >> >> this message or any attachment. Please also notify the sender by > >> replying > >> >> to this email or by telephone (+44(020 7896 0011) and then delete the > >> email > >> >> and any copies of it. Opinions, conclusion (etc) that do not relate > to > >> the > >> >> official business of this company shall be understood as neither > given > >> nor > >> >> endorsed by it. IG is a trading name of IG Markets Limited (a company > >> >> registered in England and Wales, company number 04008957) and IG > Index > >> >> Limited (a company registered in England and Wales, company number > >> >> 01190902). Registered address at Cannon Bridge House, 25 Dowgate > Hill, > >> >> London EC4R 2YA. Both IG Markets Limited (register number 195355) > and IG > >> >> Index Limited (register number 114059) are authorised and regulated > by > >> the > >> >> Financial Conduct Authority. > >> >> > >> > >> > >> > >> -- > >> Gwen Shapira > >> Product Manager | Confluent > >> 650.450.2760 | @gwenshap > >> Follow us: Twitter | blog > >> > > > > -- > Gwen Shapira > Product Manager | Confluent > 650.450.2760 | @gwenshap > Follow us: Twitter | blog > -- Nacho (Ignacio) Solis Kafka nso...@linkedin.com