Forgot to mention: Thank you for quantifying the trade-off - it is helpful and important regardless of what we end up deciding.
On Tue, Nov 8, 2016 at 3:12 PM, Sean McCauliff <smccaul...@linkedin.com.invalid> wrote: > On Tue, Nov 8, 2016 at 2:15 PM, Gwen Shapira <g...@confluent.io> wrote: > >> Since Kafka specifically targets high-throughput, low-latency >> use-cases, I don't think we should trade them off that easily. >> > > I find these kind of design goals not to be really helpful unless it's > quantified in someway. Because it's always possible to argue against > something as either being not performant or just an implementation detail. > > This is a single threaded benchmarks so all the measurements are per > thread. > > For 1M messages/s/thread if header keys are int and you had even a single > header key, value pair then it's still about 2^-2 microseconds which means > you only have another 0.75 microseconds to do everything else you want to > do with a message (1M messages/s means 1 micro second per message). With > string header keys there is still 0.5 micro seconds to process a message. > > > > I love strings as much as the next guy (we had them in Flume), but I >> was convinced by Magnus/Michael/Radai that strings don't actually have >> strong benefits as opposed to ints (you'll need a string registry >> anyway - otherwise, how will you know what does the "profile_id" >> header refers to?) and I want to keep closer to our original design >> goals for Kafka. >> > > "confluent.profile_id" > > >> >> If someone likes strings in the headers and doesn't do millions of >> messages a sec, they probably have lots of other systems they can use >> instead. >> > > None of them will scale like Kafka. Horizontal scaling is still good. > > >> >> >> On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff >> <smccaul...@linkedin.com.invalid> wrote: >> > +1 for String keys. >> > >> > I've been doing some bechmarking and it seems like the speedup for using >> > integer keys is about 2-5 depending on the length of the strings and what >> > collections are being used. The overall amount of time spent parsing a >> set >> > of header key, value pairs probably does not matter unless you are >> getting >> > close to 1M messages per consumer. In which case probably don't use >> > headers. There is also the option to use very short strings; some that >> are >> > even shorter than integers. >> > >> > Partitioning the string key space will be easier than partitioning an >> > integer key space. We won't need a global registry. Kafka internally can >> > reserve some prefix like "_" as its namespace. Everyone else can use >> their >> > company or project name as namespace prefix and life should be good. >> > >> > Here's the link to some of the benchmarking info: >> > https://docs.google.com/document/d/1tfT-6SZdnKOLyWGDH82kS30PnUkmgb7nPL >> dw6p65pAI/edit?usp=sharing >> > >> > >> > >> > -- >> > Sean McCauliff >> > Staff Software Engineer >> > Kafka >> > >> > smccaul...@linkedin.com >> > linkedin.com/in/sean-mccauliff-b563192 >> > >> > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <michael.pea...@ig.com> >> > wrote: >> > >> >> +1 on this slimmer version of our proposal >> >> >> >> I def think the Id space we can reduce from the proposed int32(4bytes) >> >> down to int16(2bytes) it saves on space and as headers we wouldn't >> expect >> >> the number of headers being used concurrently being that high. >> >> >> >> I would wonder if we should make the value byte array length still int32 >> >> though as This is the standard Max array length in Java saying that it >> is a >> >> header and I guess limiting the size is sensible and would work for all >> the >> >> use cases we have in mind so happy with limiting this. >> >> >> >> Do people generally concur on Magnus's slimmer version? Anyone see any >> >> issues if we moved from int32 to int16? >> >> >> >> Re configurable ids per plugin over a global registry also would work >> for >> >> us. As such if this has better concensus over the proposed global >> registry >> >> I'd be happy to change that. >> >> >> >> I was already sold on ints over strings for keys ;) >> >> >> >> Cheers >> >> Mike >> >> >> >> ________________________________________ >> >> From: Magnus Edenhill <mag...@edenhill.se> >> >> Sent: Monday, November 7, 2016 10:10:21 PM >> >> To: dev@kafka.apache.org >> >> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers >> >> >> >> Hi, >> >> >> >> I'm +1 for adding generic message headers, but I do share the concerns >> >> previously aired on this thread and during the KIP meeting. >> >> >> >> So let me propose a slimmer alternative that does not require any sort >> of >> >> global header registry, does not affect broker performance or >> operations, >> >> and adds as little overhead as possible. >> >> >> >> >> >> Message >> >> ------------ >> >> The protocol Message type is extended with a Headers array consting of >> >> Tags, where a Tag is defined as: >> >> int16 Id >> >> int16 Len // binary_data length >> >> binary_data[Len] // opaque binary data >> >> >> >> >> >> Ids >> >> --- >> >> The Id space is not centrally managed, so whenever an application needs >> to >> >> add headers, or use an eco-system plugin that does, its Id allocation >> will >> >> need to be manually configured. >> >> This moves the allocation concern from the global space down to >> >> organization level and avoids the risk for id conflicts. >> >> Example pseudo-config for some app: >> >> sometrackerplugin.tag.sourcev3.id=1000 >> >> dbthing.tag.tablename.id=1001 >> >> myschemareg.tag.schemaname.id=1002 >> >> myschemareg.tag.schemaversion.id=1003 >> >> >> >> >> >> Each header-writing or header-reading plugin must provide means >> (typically >> >> through configuration) to specify the tag for each header it uses. >> Defaults >> >> should be avoided. >> >> A consumer silently ignores tags it does not have a mapping for (since >> the >> >> binary_data can't be parsed without knowing what it is). >> >> >> >> Id range 0..999 is reserved for future use by the broker and must not be >> >> used by plugins. >> >> >> >> >> >> >> >> Broker >> >> --------- >> >> The broker does not process the tags (other than the standard protocol >> >> syntax verification), it simply stores and forwards them as opaque data. >> >> >> >> Standard message translation (removal of Headers) kicks in for older >> >> clients. >> >> >> >> >> >> Why not string ids? >> >> ------------------------- >> >> String ids might seem like a good idea, but: >> >> * does not really solve uniqueness >> >> * consumes a lot of space (2 byte string length + string, per header) >> to >> >> be meaningful >> >> * doesn't really say anything how to parse the tag's data, so it is in >> >> effect useless on its own. >> >> >> >> >> >> Regards, >> >> Magnus >> >> >> >> >> >> >> >> >> >> 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>: >> >> >> >> > Hi Roger, >> >> > >> >> > Thanks for the support. >> >> > >> >> > I think the key thing is to have a common key space to make an >> ecosystem, >> >> > there does have to be some level of contract for people to play >> nicely. >> >> > >> >> > Having map<String, byte[]> or as per current proposed in kip of >> having a >> >> > numerical key space of map<int, byte[]> is a level of the contract >> that >> >> > most people would expect. >> >> > >> >> > I think the example in a previous comment someone else made linking to >> >> AWS >> >> > blog and also implemented api where originally they didn’t have a >> header >> >> > space but not they do, where keys are uniform but the value can be >> >> string, >> >> > int, anything is a good example. >> >> > >> >> > Having a custom MetadataSerializer is something we had played with, >> but >> >> > discounted the idea, as if you wanted everyone to work the same way in >> >> the >> >> > ecosystem, having to have this also customizable makes it a bit >> harder. >> >> > Think about making the whole message record custom serializable, this >> >> would >> >> > make it fairly tricky (though it would not be impossible) to have made >> >> work >> >> > nicely. Having the value customizable we thought is a reasonable >> tradeoff >> >> > here of flexibility over contract of interaction between different >> >> parties. >> >> > >> >> > Is there a particular case or benefit of having serialization >> >> customizable >> >> > that you have in mind? >> >> > >> >> > Saying this it is obviously something that could be implemented, if >> there >> >> > is a need. If we did go this avenue I think a defaulted serializer >> >> > implementation should exist so for the 80:20 rule, people can just >> have >> >> the >> >> > broker and clients get default behavior. >> >> > >> >> > Cheers >> >> > Mike >> >> > >> >> > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote: >> >> > >> >> > making header _key_ serialization configurable potentially >> undermines >> >> > the >> >> > board usefulness of the feature (any point along the path must be >> >> able >> >> > to >> >> > read the header keys. the values may be whatever and require more >> >> > intimate >> >> > knowledge of the code that produced specific headers, but keys >> should >> >> > be >> >> > universally readable). >> >> > >> >> > it would also make it hard to write really portable plugins - say >> i >> >> > wrote a >> >> > large message splitter/combiner - if i rely on key "largeMessage" >> and >> >> > values of the form "1/20" someone who uses (contrived example) >> >> > Map<Byte[], >> >> > Double> wouldnt be able to re-use my code. >> >> > >> >> > not the end of a the world within an organization, but >> problematic if >> >> > you >> >> > want to enable an ecosystem >> >> > >> >> > On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover < >> roger.hoo...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> > > As others have laid out, I see strong reasons for a common >> message >> >> > > metadata structure for the Kafka ecosystem. In particular, I've >> >> > seen that >> >> > > even within a single organization, infrastructure teams often >> own >> >> the >> >> > > message metadata while application teams own the >> application-level >> >> > data >> >> > > format. Allowing metadata and content to have different >> structure >> >> > and >> >> > > evolve separately is very helpful for this. Also, I think >> there's >> >> a >> >> > lot of >> >> > > value to having a common metadata structure shared across the >> Kafka >> >> > > ecosystem so that tools which leverage metadata can more easily >> be >> >> > shared >> >> > > across organizations and integrated together. >> >> > > >> >> > > The question is, where does the metadata structure belong? >> Here's >> >> > my take: >> >> > > >> >> > > We change the Kafka wire and on-disk format to from a (key, >> value) >> >> > model to >> >> > > a (key, metadata, value) model where all three are byte arrays >> from >> >> > the >> >> > > brokers point of view. The primary reason for this is that it >> >> > provides a >> >> > > backward compatible migration path forward. Producers can start >> >> > populating >> >> > > metadata fields before all consumers understand the metadata >> >> > structure. >> >> > > For people who already have custom envelope structures, they can >> >> > populate >> >> > > their existing structure and the new structure for a while as >> they >> >> > make the >> >> > > transition. >> >> > > >> >> > > We could stop there and let the clients plug in a KeySerializer, >> >> > > MetadataSerializer, and ValueSerializer but I think it is also >> be >> >> > useful to >> >> > > have a default MetadataSerializer that implements a key-value >> model >> >> > similar >> >> > > to AMQP or HTTP headers. Or we could go even further and >> >> prescribe a >> >> > > Map<String, byte[]> or Map<String, String> data model for >> headers >> >> in >> >> > the >> >> > > clients (while still allowing custom serialization of the header >> >> data >> >> > > model). >> >> > > >> >> > > I think this would address Radai's concerns: >> >> > > 1. All client code would not need to be updated to know about >> the >> >> > > container. >> >> > > 2. Middleware friendly clients would have a standard header data >> >> > model to >> >> > > work with. >> >> > > 3. KIP is required both b/c of broker changes and because of >> client >> >> > API >> >> > > changes. >> >> > > >> >> > > Cheers, >> >> > > >> >> > > Roger >> >> > > >> >> > > >> >> > > On Wed, Nov 2, 2016 at 4:38 PM, radai < >> radai.rosenbl...@gmail.com> >> >> > wrote: >> >> > > >> >> > > > my biggest issues with a "standard" wrapper format: >> >> > > > >> >> > > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) must >> be >> >> > updated >> >> > > to >> >> > > > know about the container, because any old naive code trying to >> >> > directly >> >> > > > deserialize its own payload would keel over and die (it needs >> to >> >> > know to >> >> > > > deserialize a container, and then dig in there for its >> payload). >> >> > > > 2. in order to write middleware-friendly clients that utilize >> >> such >> >> > a >> >> > > > container one would basically have to write their own >> >> > producer/consumer >> >> > > API >> >> > > > on top of the open source kafka one. >> >> > > > 3. if you were going to go with a wrapper format you really >> dont >> >> > need to >> >> > > > bother with a kip (just open source your own client stack >> from #2 >> >> > above >> >> > > so >> >> > > > others could stop re-inventing it) >> >> > > > >> >> > > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng < >> >> wushuja...@gmail.com> >> >> > > wrote: >> >> > > > >> >> > > > > How exactly would this work? Or maybe that's out of scope >> for >> >> > this >> >> > > email. >> >> > > > >> >> > > >> >> > >> >> > >> >> > The information contained in this email is strictly confidential and >> for >> >> > the use of the addressee only, unless otherwise indicated. If you are >> not >> >> > the intended recipient, please do not read, copy, use or disclose to >> >> others >> >> > this message or any attachment. Please also notify the sender by >> replying >> >> > to this email or by telephone (+44(020 7896 0011) and then delete the >> >> email >> >> > and any copies of it. Opinions, conclusion (etc) that do not relate to >> >> the >> >> > official business of this company shall be understood as neither given >> >> nor >> >> > endorsed by it. IG is a trading name of IG Markets Limited (a company >> >> > registered in England and Wales, company number 04008957) and IG Index >> >> > Limited (a company registered in England and Wales, company number >> >> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill, >> >> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and >> IG >> >> > Index Limited (register number 114059) are authorised and regulated by >> >> the >> >> > Financial Conduct Authority. >> >> > >> >> The information contained in this email is strictly confidential and for >> >> the use of the addressee only, unless otherwise indicated. If you are >> not >> >> the intended recipient, please do not read, copy, use or disclose to >> others >> >> this message or any attachment. Please also notify the sender by >> replying >> >> to this email or by telephone (+44(020 7896 0011) and then delete the >> email >> >> and any copies of it. Opinions, conclusion (etc) that do not relate to >> the >> >> official business of this company shall be understood as neither given >> nor >> >> endorsed by it. IG is a trading name of IG Markets Limited (a company >> >> registered in England and Wales, company number 04008957) and IG Index >> >> Limited (a company registered in England and Wales, company number >> >> 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill, >> >> London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG >> >> Index Limited (register number 114059) are authorised and regulated by >> the >> >> Financial Conduct Authority. >> >> >> >> >> >> -- >> Gwen Shapira >> Product Manager | Confluent >> 650.450.2760 | @gwenshap >> Follow us: Twitter | blog >> -- Gwen Shapira Product Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter | blog