Since Kafka specifically targets high-throughput, low-latency use-cases, I don't think we should trade them off that easily.
I love strings as much as the next guy (we had them in Flume), but I was convinced by Magnus/Michael/Radai that strings don't actually have strong benefits as opposed to ints (you'll need a string registry anyway - otherwise, how will you know what does the "profile_id" header refers to?) and I want to keep closer to our original design goals for Kafka. If someone likes strings in the headers and doesn't do millions of messages a sec, they probably have lots of other systems they can use instead. On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff <smccaul...@linkedin.com.invalid> wrote: > +1 for String keys. > > I've been doing some bechmarking and it seems like the speedup for using > integer keys is about 2-5 depending on the length of the strings and what > collections are being used. The overall amount of time spent parsing a set > of header key, value pairs probably does not matter unless you are getting > close to 1M messages per consumer. In which case probably don't use > headers. There is also the option to use very short strings; some that are > even shorter than integers. > > Partitioning the string key space will be easier than partitioning an > integer key space. We won't need a global registry. Kafka internally can > reserve some prefix like "_" as its namespace. Everyone else can use their > company or project name as namespace prefix and life should be good. > > Here's the link to some of the benchmarking info: > https://docs.google.com/document/d/1tfT-6SZdnKOLyWGDH82kS30PnUkmgb7nPLdw6p65pAI/edit?usp=sharing > > > > -- > Sean McCauliff > Staff Software Engineer > Kafka > > smccaul...@linkedin.com > linkedin.com/in/sean-mccauliff-b563192 > > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <michael.pea...@ig.com> > wrote: > >> +1 on this slimmer version of our proposal >> >> I def think the Id space we can reduce from the proposed int32(4bytes) >> down to int16(2bytes) it saves on space and as headers we wouldn't expect >> the number of headers being used concurrently being that high. >> >> I would wonder if we should make the value byte array length still int32 >> though as This is the standard Max array length in Java saying that it is a >> header and I guess limiting the size is sensible and would work for all the >> use cases we have in mind so happy with limiting this. >> >> Do people generally concur on Magnus's slimmer version? Anyone see any >> issues if we moved from int32 to int16? >> >> Re configurable ids per plugin over a global registry also would work for >> us. As such if this has better concensus over the proposed global registry >> I'd be happy to change that. >> >> I was already sold on ints over strings for keys ;) >> >> Cheers >> Mike >> >> ________________________________________ >> From: Magnus Edenhill <mag...@edenhill.se> >> Sent: Monday, November 7, 2016 10:10:21 PM >> To: dev@kafka.apache.org >> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers >> >> Hi, >> >> I'm +1 for adding generic message headers, but I do share the concerns >> previously aired on this thread and during the KIP meeting. >> >> So let me propose a slimmer alternative that does not require any sort of >> global header registry, does not affect broker performance or operations, >> and adds as little overhead as possible. >> >> >> Message >> ------------ >> The protocol Message type is extended with a Headers array consting of >> Tags, where a Tag is defined as: >> int16 Id >> int16 Len // binary_data length >> binary_data[Len] // opaque binary data >> >> >> Ids >> --- >> The Id space is not centrally managed, so whenever an application needs to >> add headers, or use an eco-system plugin that does, its Id allocation will >> need to be manually configured. >> This moves the allocation concern from the global space down to >> organization level and avoids the risk for id conflicts. >> Example pseudo-config for some app: >> sometrackerplugin.tag.sourcev3.id=1000 >> dbthing.tag.tablename.id=1001 >> myschemareg.tag.schemaname.id=1002 >> myschemareg.tag.schemaversion.id=1003 >> >> >> Each header-writing or header-reading plugin must provide means (typically >> through configuration) to specify the tag for each header it uses. Defaults >> should be avoided. >> A consumer silently ignores tags it does not have a mapping for (since the >> binary_data can't be parsed without knowing what it is). >> >> Id range 0..999 is reserved for future use by the broker and must not be >> used by plugins. >> >> >> >> Broker >> --------- >> The broker does not process the tags (other than the standard protocol >> syntax verification), it simply stores and forwards them as opaque data. >> >> Standard message translation (removal of Headers) kicks in for older >> clients. >> >> >> Why not string ids? >> ------------------------- >> String ids might seem like a good idea, but: >> * does not really solve uniqueness >> * consumes a lot of space (2 byte string length + string, per header) to >> be meaningful >> * doesn't really say anything how to parse the tag's data, so it is in >> effect useless on its own. >> >> >> Regards, >> Magnus >> >> >> >> >> 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>: >> >> > Hi Roger, >> > >> > Thanks for the support. >> > >> > I think the key thing is to have a common key space to make an ecosystem, >> > there does have to be some level of contract for people to play nicely. >> > >> > Having map<String, byte[]> or as per current proposed in kip of having a >> > numerical key space of map<int, byte[]> is a level of the contract that >> > most people would expect. >> > >> > I think the example in a previous comment someone else made linking to >> AWS >> > blog and also implemented api where originally they didn’t have a header >> > space but not they do, where keys are uniform but the value can be >> string, >> > int, anything is a good example. >> > >> > Having a custom MetadataSerializer is something we had played with, but >> > discounted the idea, as if you wanted everyone to work the same way in >> the >> > ecosystem, having to have this also customizable makes it a bit harder. >> > Think about making the whole message record custom serializable, this >> would >> > make it fairly tricky (though it would not be impossible) to have made >> work >> > nicely. Having the value customizable we thought is a reasonable tradeoff >> > here of flexibility over contract of interaction between different >> parties. >> > >> > Is there a particular case or benefit of having serialization >> customizable >> > that you have in mind? >> > >> > Saying this it is obviously something that could be implemented, if there >> > is a need. If we did go this avenue I think a defaulted serializer >> > implementation should exist so for the 80:20 rule, people can just have >> the >> > broker and clients get default behavior. >> > >> > Cheers >> > Mike >> > >> > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote: >> > >> > making header _key_ serialization configurable potentially undermines >> > the >> > board usefulness of the feature (any point along the path must be >> able >> > to >> > read the header keys. the values may be whatever and require more >> > intimate >> > knowledge of the code that produced specific headers, but keys should >> > be >> > universally readable). >> > >> > it would also make it hard to write really portable plugins - say i >> > wrote a >> > large message splitter/combiner - if i rely on key "largeMessage" and >> > values of the form "1/20" someone who uses (contrived example) >> > Map<Byte[], >> > Double> wouldnt be able to re-use my code. >> > >> > not the end of a the world within an organization, but problematic if >> > you >> > want to enable an ecosystem >> > >> > On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover <roger.hoo...@gmail.com >> > >> > wrote: >> > >> > > As others have laid out, I see strong reasons for a common message >> > > metadata structure for the Kafka ecosystem. In particular, I've >> > seen that >> > > even within a single organization, infrastructure teams often own >> the >> > > message metadata while application teams own the application-level >> > data >> > > format. Allowing metadata and content to have different structure >> > and >> > > evolve separately is very helpful for this. Also, I think there's >> a >> > lot of >> > > value to having a common metadata structure shared across the Kafka >> > > ecosystem so that tools which leverage metadata can more easily be >> > shared >> > > across organizations and integrated together. >> > > >> > > The question is, where does the metadata structure belong? Here's >> > my take: >> > > >> > > We change the Kafka wire and on-disk format to from a (key, value) >> > model to >> > > a (key, metadata, value) model where all three are byte arrays from >> > the >> > > brokers point of view. The primary reason for this is that it >> > provides a >> > > backward compatible migration path forward. Producers can start >> > populating >> > > metadata fields before all consumers understand the metadata >> > structure. >> > > For people who already have custom envelope structures, they can >> > populate >> > > their existing structure and the new structure for a while as they >> > make the >> > > transition. >> > > >> > > We could stop there and let the clients plug in a KeySerializer, >> > > MetadataSerializer, and ValueSerializer but I think it is also be >> > useful to >> > > have a default MetadataSerializer that implements a key-value model >> > similar >> > > to AMQP or HTTP headers. Or we could go even further and >> prescribe a >> > > Map<String, byte[]> or Map<String, String> data model for headers >> in >> > the >> > > clients (while still allowing custom serialization of the header >> data >> > > model). >> > > >> > > I think this would address Radai's concerns: >> > > 1. All client code would not need to be updated to know about the >> > > container. >> > > 2. Middleware friendly clients would have a standard header data >> > model to >> > > work with. >> > > 3. KIP is required both b/c of broker changes and because of client >> > API >> > > changes. >> > > >> > > Cheers, >> > > >> > > Roger >> > > >> > > >> > > On Wed, Nov 2, 2016 at 4:38 PM, radai <radai.rosenbl...@gmail.com> >> > wrote: >> > > >> > > > my biggest issues with a "standard" wrapper format: >> > > > >> > > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) must be >> > updated >> > > to >> > > > know about the container, because any old naive code trying to >> > directly >> > > > deserialize its own payload would keel over and die (it needs to >> > know to >> > > > deserialize a container, and then dig in there for its payload). >> > > > 2. in order to write middleware-friendly clients that utilize >> such >> > a >> > > > container one would basically have to write their own >> > producer/consumer >> > > API >> > > > on top of the open source kafka one. >> > > > 3. if you were going to go with a wrapper format you really dont >> > need to >> > > > bother with a kip (just open source your own client stack from #2 >> > above >> > > so >> > > > others could stop re-inventing it) >> > > > >> > > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng < >> wushuja...@gmail.com> >> > > wrote: >> > > > >> > > > > How exactly would this work? Or maybe that's out of scope for >> > this >> > > email. >> > > > >> > > >> > >> > >> > The information contained in this email is strictly confidential and for >> > the use of the addressee only, unless otherwise indicated. If you are not >> > the intended recipient, please do not read, copy, use or disclose to >> others >> > this message or any attachment. Please also notify the sender by replying >> > to this email or by telephone (+44(020 7896 0011) and then delete the >> email >> > and any copies of it. Opinions, conclusion (etc) that do not relate to >> the >> > official business of this company shall be understood as neither given >> nor >> > endorsed by it. IG is a trading name of IG Markets Limited (a company >> > registered in England and Wales, company number 04008957) and IG Index >> > Limited (a company registered in England and Wales, company number >> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill, >> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG >> > Index Limited (register number 114059) are authorised and regulated by >> the >> > Financial Conduct Authority. >> > >> The information contained in this email is strictly confidential and for >> the use of the addressee only, unless otherwise indicated. If you are not >> the intended recipient, please do not read, copy, use or disclose to others >> this message or any attachment. Please also notify the sender by replying >> to this email or by telephone (+44(020 7896 0011) and then delete the email >> and any copies of it. Opinions, conclusion (etc) that do not relate to the >> official business of this company shall be understood as neither given nor >> endorsed by it. IG is a trading name of IG Markets Limited (a company >> registered in England and Wales, company number 04008957) and IG Index >> Limited (a company registered in England and Wales, company number >> 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill, >> London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG >> Index Limited (register number 114059) are authorised and regulated by the >> Financial Conduct Authority. >> -- Gwen Shapira Product Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter | blog