Re: [DISCUSS] KIP-82 - Add Record Headers

Gwen Shapira Tue, 08 Nov 2016 14:16:29 -0800

Since Kafka specifically targets high-throughput, low-latency
use-cases, I don't think we should trade them off that easily.


I love strings as much as the next guy (we had them in Flume), but I
was convinced by Magnus/Michael/Radai that strings don't actually have
strong benefits as opposed to ints (you'll need a string registry
anyway - otherwise, how will you know what does the "profile_id"
header refers to?) and I want to keep closer to our original design
goals for Kafka.

If someone likes strings in the headers and doesn't do millions of
messages a sec, they probably have lots of other systems they can use
instead.


On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff
<smccaul...@linkedin.com.invalid> wrote:
> +1 for String keys.
>
> I've been doing some bechmarking and it seems like the speedup for using
> integer keys is about 2-5 depending on the length of the strings and what
> collections are being used.  The overall amount of time spent parsing a set
> of header key, value pairs probably does not matter unless you are getting
> close to 1M messages per consumer.  In which case probably don't use
> headers.  There is also the option to use very short strings; some that are
> even shorter than integers.
>
> Partitioning the string key space will be easier than partitioning an
> integer key space. We won't need a global registry.  Kafka internally can
> reserve some prefix like "_" as its namespace.  Everyone else can use their
> company or project name as namespace prefix and life should be good.
>
> Here's the link to some of the benchmarking info:
> https://docs.google.com/document/d/1tfT-6SZdnKOLyWGDH82kS30PnUkmgb7nPLdw6p65pAI/edit?usp=sharing
>
>
>
> --
> Sean McCauliff
> Staff Software Engineer
> Kafka
>
> smccaul...@linkedin.com
> linkedin.com/in/sean-mccauliff-b563192
>
> On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <michael.pea...@ig.com>
> wrote:
>
>> +1 on this slimmer version of our proposal
>>
>> I def think the Id space we can reduce from the proposed int32(4bytes)
>> down to int16(2bytes) it saves on space and as headers we wouldn't expect
>> the number of headers being used concurrently being that high.
>>
>> I would wonder if we should make the value byte array length still int32
>> though as This is the standard Max array length in Java saying that it is a
>> header and I guess limiting the size is sensible and would work for all the
>> use cases we have in mind so happy with limiting this.
>>
>> Do people generally concur on Magnus's slimmer version? Anyone see any
>> issues if we moved from int32 to int16?
>>
>> Re configurable ids per plugin over a global registry also would work for
>> us.  As such if this has better concensus over the proposed global registry
>> I'd be happy to change that.
>>
>> I was already sold on ints over strings for keys ;)
>>
>> Cheers
>> Mike
>>
>> ________________________________________
>> From: Magnus Edenhill <mag...@edenhill.se>
>> Sent: Monday, November 7, 2016 10:10:21 PM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers
>>
>> Hi,
>>
>> I'm +1 for adding generic message headers, but I do share the concerns
>> previously aired on this thread and during the KIP meeting.
>>
>> So let me propose a slimmer alternative that does not require any sort of
>> global header registry, does not affect broker performance or operations,
>> and adds as little overhead as possible.
>>
>>
>> Message
>> ------------
>> The protocol Message type is extended with a Headers array consting of
>> Tags, where a Tag is defined as:
>>    int16 Id
>>    int16 Len              // binary_data length
>>    binary_data[Len]  // opaque binary data
>>
>>
>> Ids
>> ---
>> The Id space is not centrally managed, so whenever an application needs to
>> add headers, or use an eco-system plugin that does, its Id allocation will
>> need to be manually configured.
>> This moves the allocation concern from the global space down to
>> organization level and avoids the risk for id conflicts.
>> Example pseudo-config for some app:
>>     sometrackerplugin.tag.sourcev3.id=1000
>>     dbthing.tag.tablename.id=1001
>>     myschemareg.tag.schemaname.id=1002
>>     myschemareg.tag.schemaversion.id=1003
>>
>>
>> Each header-writing or header-reading plugin must provide means (typically
>> through configuration) to specify the tag for each header it uses. Defaults
>> should be avoided.
>> A consumer silently ignores tags it does not have a mapping for (since the
>> binary_data can't be parsed without knowing what it is).
>>
>> Id range 0..999 is reserved for future use by the broker and must not be
>> used by plugins.
>>
>>
>>
>> Broker
>> ---------
>> The broker does not process the tags (other than the standard protocol
>> syntax verification), it simply stores and forwards them as opaque data.
>>
>> Standard message translation (removal of Headers) kicks in for older
>> clients.
>>
>>
>> Why not string ids?
>> -------------------------
>> String ids might seem like a good idea, but:
>>  * does not really solve uniqueness
>>  * consumes a lot of space (2 byte string length + string, per header) to
>> be meaningful
>>  * doesn't really say anything how to parse the tag's data, so it is in
>> effect useless on its own.
>>
>>
>> Regards,
>> Magnus
>>
>>
>>
>>
>> 2016-11-07 18:32 GMT+01:00 Michael Pearce <michael.pea...@ig.com>:
>>
>> > Hi Roger,
>> >
>> > Thanks for the support.
>> >
>> > I think the key thing is to have a common key space to make an ecosystem,
>> > there does have to be some level of contract for people to play nicely.
>> >
>> > Having map<String, byte[]> or as per current proposed in kip of having a
>> > numerical key space of  map<int, byte[]> is a level of the contract that
>> > most people would expect.
>> >
>> > I think the example in a previous comment someone else made linking to
>> AWS
>> > blog and also implemented api where originally they didn’t have a header
>> > space but not they do, where keys are uniform but the value can be
>> string,
>> > int, anything is a good example.
>> >
>> > Having a custom MetadataSerializer is something we had played with, but
>> > discounted the idea, as if you wanted everyone to work the same way in
>> the
>> > ecosystem, having to have this also customizable makes it a bit harder.
>> > Think about making the whole message record custom serializable, this
>> would
>> > make it fairly tricky (though it would not be impossible) to have made
>> work
>> > nicely. Having the value customizable we thought is a reasonable tradeoff
>> > here of flexibility over contract of interaction between different
>> parties.
>> >
>> > Is there a particular case or benefit of having serialization
>> customizable
>> > that you have in mind?
>> >
>> > Saying this it is obviously something that could be implemented, if there
>> > is a need. If we did go this avenue I think a defaulted serializer
>> > implementation should exist so for the 80:20 rule, people can just have
>> the
>> > broker and clients get default behavior.
>> >
>> > Cheers
>> > Mike
>> >
>> > On 11/6/16, 5:25 PM, "radai" <radai.rosenbl...@gmail.com> wrote:
>> >
>> >     making header _key_ serialization configurable potentially undermines
>> > the
>> >     board usefulness of the feature (any point along the path must be
>> able
>> > to
>> >     read the header keys. the values may be whatever and require more
>> > intimate
>> >     knowledge of the code that produced specific headers, but keys should
>> > be
>> >     universally readable).
>> >
>> >     it would also make it hard to write really portable plugins - say i
>> > wrote a
>> >     large message splitter/combiner - if i rely on key "largeMessage" and
>> >     values of the form "1/20" someone who uses (contrived example)
>> > Map<Byte[],
>> >     Double> wouldnt be able to re-use my code.
>> >
>> >     not the end of a the world within an organization, but problematic if
>> > you
>> >     want to enable an ecosystem
>> >
>> >     On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover <roger.hoo...@gmail.com
>> >
>> > wrote:
>> >
>> >     >  As others have laid out, I see strong reasons for a common message
>> >     > metadata structure for the Kafka ecosystem.  In particular, I've
>> > seen that
>> >     > even within a single organization, infrastructure teams often own
>> the
>> >     > message metadata while application teams own the application-level
>> > data
>> >     > format.  Allowing metadata and content to have different structure
>> > and
>> >     > evolve separately is very helpful for this.  Also, I think there's
>> a
>> > lot of
>> >     > value to having a common metadata structure shared across the Kafka
>> >     > ecosystem so that tools which leverage metadata can more easily be
>> > shared
>> >     > across organizations and integrated together.
>> >     >
>> >     > The question is, where does the metadata structure belong?  Here's
>> > my take:
>> >     >
>> >     > We change the Kafka wire and on-disk format to from a (key, value)
>> > model to
>> >     > a (key, metadata, value) model where all three are byte arrays from
>> > the
>> >     > brokers point of view.  The primary reason for this is that it
>> > provides a
>> >     > backward compatible migration path forward.  Producers can start
>> > populating
>> >     > metadata fields before all consumers understand the metadata
>> > structure.
>> >     > For people who already have custom envelope structures, they can
>> > populate
>> >     > their existing structure and the new structure for a while as they
>> > make the
>> >     > transition.
>> >     >
>> >     > We could stop there and let the clients plug in a KeySerializer,
>> >     > MetadataSerializer, and ValueSerializer but I think it is also be
>> > useful to
>> >     > have a default MetadataSerializer that implements a key-value model
>> > similar
>> >     > to AMQP or HTTP headers.  Or we could go even further and
>> prescribe a
>> >     > Map<String, byte[]> or Map<String, String> data model for headers
>> in
>> > the
>> >     > clients (while still allowing custom serialization of the header
>> data
>> >     > model).
>> >     >
>> >     > I think this would address Radai's concerns:
>> >     > 1. All client code would not need to be updated to know about the
>> >     > container.
>> >     > 2. Middleware friendly clients would have a standard header data
>> > model to
>> >     > work with.
>> >     > 3. KIP is required both b/c of broker changes and because of client
>> > API
>> >     > changes.
>> >     >
>> >     > Cheers,
>> >     >
>> >     > Roger
>> >     >
>> >     >
>> >     > On Wed, Nov 2, 2016 at 4:38 PM, radai <radai.rosenbl...@gmail.com>
>> > wrote:
>> >     >
>> >     > > my biggest issues with a "standard" wrapper format:
>> >     > >
>> >     > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) must be
>> > updated
>> >     > to
>> >     > > know about the container, because any old naive code trying to
>> > directly
>> >     > > deserialize its own payload would keel over and die (it needs to
>> > know to
>> >     > > deserialize a container, and then dig in there for its payload).
>> >     > > 2. in order to write middleware-friendly clients that utilize
>> such
>> > a
>> >     > > container one would basically have to write their own
>> > producer/consumer
>> >     > API
>> >     > > on top of the open source kafka one.
>> >     > > 3. if you were going to go with a wrapper format you really dont
>> > need to
>> >     > > bother with a kip (just open source your own client stack from #2
>> > above
>> >     > so
>> >     > > others could stop re-inventing it)
>> >     > >
>> >     > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng <
>> wushuja...@gmail.com>
>> >     > wrote:
>> >     > >
>> >     > > > How exactly would this work? Or maybe that's out of scope for
>> > this
>> >     > email.
>> >     > >
>> >     >
>> >
>> >
>> > The information contained in this email is strictly confidential and for
>> > the use of the addressee only, unless otherwise indicated. If you are not
>> > the intended recipient, please do not read, copy, use or disclose to
>> others
>> > this message or any attachment. Please also notify the sender by replying
>> > to this email or by telephone (+44(020 7896 0011) and then delete the
>> email
>> > and any copies of it. Opinions, conclusion (etc) that do not relate to
>> the
>> > official business of this company shall be understood as neither given
>> nor
>> > endorsed by it. IG is a trading name of IG Markets Limited (a company
>> > registered in England and Wales, company number 04008957) and IG Index
>> > Limited (a company registered in England and Wales, company number
>> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
>> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG
>> > Index Limited (register number 114059) are authorised and regulated by
>> the
>> > Financial Conduct Authority.
>> >
>> The information contained in this email is strictly confidential and for
>> the use of the addressee only, unless otherwise indicated. If you are not
>> the intended recipient, please do not read, copy, use or disclose to others
>> this message or any attachment. Please also notify the sender by replying
>> to this email or by telephone (+44(020 7896 0011) and then delete the email
>> and any copies of it. Opinions, conclusion (etc) that do not relate to the
>> official business of this company shall be understood as neither given nor
>> endorsed by it. IG is a trading name of IG Markets Limited (a company
>> registered in England and Wales, company number 04008957) and IG Index
>> Limited (a company registered in England and Wales, company number
>> 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
>> London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG
>> Index Limited (register number 114059) are authorised and regulated by the
>> Financial Conduct Authority.
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] KIP-82 - Add Record Headers

Reply via email to