Re: [DISCUSS] KIP-82 - Add Record Headers

Gwen Shapira Tue, 08 Nov 2016 15:37:13 -0800

Forgot to mention: Thank you for quantifying the trade-off - it is
helpful and important regardless of what we end up deciding.


On Tue, Nov 8, 2016 at 3:12 PM, Sean McCauliff
<[email protected]> wrote:
> On Tue, Nov 8, 2016 at 2:15 PM, Gwen Shapira <[email protected]> wrote:
>
>> Since Kafka specifically targets high-throughput, low-latency
>> use-cases, I don't think we should trade them off that easily.
>>
>
> I find these kind of design goals not to be really helpful unless it's
> quantified in someway.  Because it's always possible to argue against
> something as either being not performant or just an implementation detail.
>
> This is a single threaded benchmarks so all the measurements are per
> thread.
>
> For 1M messages/s/thread  if header keys are int and you had even a single
> header key, value pair then it's still about 2^-2 microseconds which means
> you only have another 0.75 microseconds to do everything else you want to
> do with a message (1M messages/s means 1 micro second per message).  With
> string header keys there is still 0.5 micro seconds to process a message.
>
>
>
> I love strings as much as the next guy (we had them in Flume), but I
>> was convinced by Magnus/Michael/Radai that strings don't actually have
>> strong benefits as opposed to ints (you'll need a string registry
>> anyway - otherwise, how will you know what does the "profile_id"
>> header refers to?) and I want to keep closer to our original design
>> goals for Kafka.
>>
>
> "confluent.profile_id"
>
>
>>
>> If someone likes strings in the headers and doesn't do millions of
>> messages a sec, they probably have lots of other systems they can use
>> instead.
>>
>
> None of them will scale like Kafka.  Horizontal scaling is still good.
>
>
>>
>>
>> On Tue, Nov 8, 2016 at 1:22 PM, Sean McCauliff
>> <[email protected]> wrote:
>> > +1 for String keys.
>> >
>> > I've been doing some bechmarking and it seems like the speedup for using
>> > integer keys is about 2-5 depending on the length of the strings and what
>> > collections are being used.  The overall amount of time spent parsing a
>> set
>> > of header key, value pairs probably does not matter unless you are
>> getting
>> > close to 1M messages per consumer.  In which case probably don't use
>> > headers.  There is also the option to use very short strings; some that
>> are
>> > even shorter than integers.
>> >
>> > Partitioning the string key space will be easier than partitioning an
>> > integer key space. We won't need a global registry.  Kafka internally can
>> > reserve some prefix like "_" as its namespace.  Everyone else can use
>> their
>> > company or project name as namespace prefix and life should be good.
>> >
>> > Here's the link to some of the benchmarking info:
>> > https://docs.google.com/document/d/1tfT-6SZdnKOLyWGDH82kS30PnUkmgb7nPL
>> dw6p65pAI/edit?usp=sharing
>> >
>> >
>> >
>> > --
>> > Sean McCauliff
>> > Staff Software Engineer
>> > Kafka
>> >
>> > [email protected]
>> > linkedin.com/in/sean-mccauliff-b563192
>> >
>> > On Mon, Nov 7, 2016 at 11:51 PM, Michael Pearce <[email protected]>
>> > wrote:
>> >
>> >> +1 on this slimmer version of our proposal
>> >>
>> >> I def think the Id space we can reduce from the proposed int32(4bytes)
>> >> down to int16(2bytes) it saves on space and as headers we wouldn't
>> expect
>> >> the number of headers being used concurrently being that high.
>> >>
>> >> I would wonder if we should make the value byte array length still int32
>> >> though as This is the standard Max array length in Java saying that it
>> is a
>> >> header and I guess limiting the size is sensible and would work for all
>> the
>> >> use cases we have in mind so happy with limiting this.
>> >>
>> >> Do people generally concur on Magnus's slimmer version? Anyone see any
>> >> issues if we moved from int32 to int16?
>> >>
>> >> Re configurable ids per plugin over a global registry also would work
>> for
>> >> us.  As such if this has better concensus over the proposed global
>> registry
>> >> I'd be happy to change that.
>> >>
>> >> I was already sold on ints over strings for keys ;)
>> >>
>> >> Cheers
>> >> Mike
>> >>
>> >> ________________________________________
>> >> From: Magnus Edenhill <[email protected]>
>> >> Sent: Monday, November 7, 2016 10:10:21 PM
>> >> To: [email protected]
>> >> Subject: Re: [DISCUSS] KIP-82 - Add Record Headers
>> >>
>> >> Hi,
>> >>
>> >> I'm +1 for adding generic message headers, but I do share the concerns
>> >> previously aired on this thread and during the KIP meeting.
>> >>
>> >> So let me propose a slimmer alternative that does not require any sort
>> of
>> >> global header registry, does not affect broker performance or
>> operations,
>> >> and adds as little overhead as possible.
>> >>
>> >>
>> >> Message
>> >> ------------
>> >> The protocol Message type is extended with a Headers array consting of
>> >> Tags, where a Tag is defined as:
>> >>    int16 Id
>> >>    int16 Len              // binary_data length
>> >>    binary_data[Len]  // opaque binary data
>> >>
>> >>
>> >> Ids
>> >> ---
>> >> The Id space is not centrally managed, so whenever an application needs
>> to
>> >> add headers, or use an eco-system plugin that does, its Id allocation
>> will
>> >> need to be manually configured.
>> >> This moves the allocation concern from the global space down to
>> >> organization level and avoids the risk for id conflicts.
>> >> Example pseudo-config for some app:
>> >>     sometrackerplugin.tag.sourcev3.id=1000
>> >>     dbthing.tag.tablename.id=1001
>> >>     myschemareg.tag.schemaname.id=1002
>> >>     myschemareg.tag.schemaversion.id=1003
>> >>
>> >>
>> >> Each header-writing or header-reading plugin must provide means
>> (typically
>> >> through configuration) to specify the tag for each header it uses.
>> Defaults
>> >> should be avoided.
>> >> A consumer silently ignores tags it does not have a mapping for (since
>> the
>> >> binary_data can't be parsed without knowing what it is).
>> >>
>> >> Id range 0..999 is reserved for future use by the broker and must not be
>> >> used by plugins.
>> >>
>> >>
>> >>
>> >> Broker
>> >> ---------
>> >> The broker does not process the tags (other than the standard protocol
>> >> syntax verification), it simply stores and forwards them as opaque data.
>> >>
>> >> Standard message translation (removal of Headers) kicks in for older
>> >> clients.
>> >>
>> >>
>> >> Why not string ids?
>> >> -------------------------
>> >> String ids might seem like a good idea, but:
>> >>  * does not really solve uniqueness
>> >>  * consumes a lot of space (2 byte string length + string, per header)
>> to
>> >> be meaningful
>> >>  * doesn't really say anything how to parse the tag's data, so it is in
>> >> effect useless on its own.
>> >>
>> >>
>> >> Regards,
>> >> Magnus
>> >>
>> >>
>> >>
>> >>
>> >> 2016-11-07 18:32 GMT+01:00 Michael Pearce <[email protected]>:
>> >>
>> >> > Hi Roger,
>> >> >
>> >> > Thanks for the support.
>> >> >
>> >> > I think the key thing is to have a common key space to make an
>> ecosystem,
>> >> > there does have to be some level of contract for people to play
>> nicely.
>> >> >
>> >> > Having map<String, byte[]> or as per current proposed in kip of
>> having a
>> >> > numerical key space of  map<int, byte[]> is a level of the contract
>> that
>> >> > most people would expect.
>> >> >
>> >> > I think the example in a previous comment someone else made linking to
>> >> AWS
>> >> > blog and also implemented api where originally they didn’t have a
>> header
>> >> > space but not they do, where keys are uniform but the value can be
>> >> string,
>> >> > int, anything is a good example.
>> >> >
>> >> > Having a custom MetadataSerializer is something we had played with,
>> but
>> >> > discounted the idea, as if you wanted everyone to work the same way in
>> >> the
>> >> > ecosystem, having to have this also customizable makes it a bit
>> harder.
>> >> > Think about making the whole message record custom serializable, this
>> >> would
>> >> > make it fairly tricky (though it would not be impossible) to have made
>> >> work
>> >> > nicely. Having the value customizable we thought is a reasonable
>> tradeoff
>> >> > here of flexibility over contract of interaction between different
>> >> parties.
>> >> >
>> >> > Is there a particular case or benefit of having serialization
>> >> customizable
>> >> > that you have in mind?
>> >> >
>> >> > Saying this it is obviously something that could be implemented, if
>> there
>> >> > is a need. If we did go this avenue I think a defaulted serializer
>> >> > implementation should exist so for the 80:20 rule, people can just
>> have
>> >> the
>> >> > broker and clients get default behavior.
>> >> >
>> >> > Cheers
>> >> > Mike
>> >> >
>> >> > On 11/6/16, 5:25 PM, "radai" <[email protected]> wrote:
>> >> >
>> >> >     making header _key_ serialization configurable potentially
>> undermines
>> >> > the
>> >> >     board usefulness of the feature (any point along the path must be
>> >> able
>> >> > to
>> >> >     read the header keys. the values may be whatever and require more
>> >> > intimate
>> >> >     knowledge of the code that produced specific headers, but keys
>> should
>> >> > be
>> >> >     universally readable).
>> >> >
>> >> >     it would also make it hard to write really portable plugins - say
>> i
>> >> > wrote a
>> >> >     large message splitter/combiner - if i rely on key "largeMessage"
>> and
>> >> >     values of the form "1/20" someone who uses (contrived example)
>> >> > Map<Byte[],
>> >> >     Double> wouldnt be able to re-use my code.
>> >> >
>> >> >     not the end of a the world within an organization, but
>> problematic if
>> >> > you
>> >> >     want to enable an ecosystem
>> >> >
>> >> >     On Thu, Nov 3, 2016 at 2:04 PM, Roger Hoover <
>> [email protected]
>> >> >
>> >> > wrote:
>> >> >
>> >> >     >  As others have laid out, I see strong reasons for a common
>> message
>> >> >     > metadata structure for the Kafka ecosystem.  In particular, I've
>> >> > seen that
>> >> >     > even within a single organization, infrastructure teams often
>> own
>> >> the
>> >> >     > message metadata while application teams own the
>> application-level
>> >> > data
>> >> >     > format.  Allowing metadata and content to have different
>> structure
>> >> > and
>> >> >     > evolve separately is very helpful for this.  Also, I think
>> there's
>> >> a
>> >> > lot of
>> >> >     > value to having a common metadata structure shared across the
>> Kafka
>> >> >     > ecosystem so that tools which leverage metadata can more easily
>> be
>> >> > shared
>> >> >     > across organizations and integrated together.
>> >> >     >
>> >> >     > The question is, where does the metadata structure belong?
>> Here's
>> >> > my take:
>> >> >     >
>> >> >     > We change the Kafka wire and on-disk format to from a (key,
>> value)
>> >> > model to
>> >> >     > a (key, metadata, value) model where all three are byte arrays
>> from
>> >> > the
>> >> >     > brokers point of view.  The primary reason for this is that it
>> >> > provides a
>> >> >     > backward compatible migration path forward.  Producers can start
>> >> > populating
>> >> >     > metadata fields before all consumers understand the metadata
>> >> > structure.
>> >> >     > For people who already have custom envelope structures, they can
>> >> > populate
>> >> >     > their existing structure and the new structure for a while as
>> they
>> >> > make the
>> >> >     > transition.
>> >> >     >
>> >> >     > We could stop there and let the clients plug in a KeySerializer,
>> >> >     > MetadataSerializer, and ValueSerializer but I think it is also
>> be
>> >> > useful to
>> >> >     > have a default MetadataSerializer that implements a key-value
>> model
>> >> > similar
>> >> >     > to AMQP or HTTP headers.  Or we could go even further and
>> >> prescribe a
>> >> >     > Map<String, byte[]> or Map<String, String> data model for
>> headers
>> >> in
>> >> > the
>> >> >     > clients (while still allowing custom serialization of the header
>> >> data
>> >> >     > model).
>> >> >     >
>> >> >     > I think this would address Radai's concerns:
>> >> >     > 1. All client code would not need to be updated to know about
>> the
>> >> >     > container.
>> >> >     > 2. Middleware friendly clients would have a standard header data
>> >> > model to
>> >> >     > work with.
>> >> >     > 3. KIP is required both b/c of broker changes and because of
>> client
>> >> > API
>> >> >     > changes.
>> >> >     >
>> >> >     > Cheers,
>> >> >     >
>> >> >     > Roger
>> >> >     >
>> >> >     >
>> >> >     > On Wed, Nov 2, 2016 at 4:38 PM, radai <
>> [email protected]>
>> >> > wrote:
>> >> >     >
>> >> >     > > my biggest issues with a "standard" wrapper format:
>> >> >     > >
>> >> >     > > 1. _ALL_ client _CODE_ (as opposed to kafka lib version) must
>> be
>> >> > updated
>> >> >     > to
>> >> >     > > know about the container, because any old naive code trying to
>> >> > directly
>> >> >     > > deserialize its own payload would keel over and die (it needs
>> to
>> >> > know to
>> >> >     > > deserialize a container, and then dig in there for its
>> payload).
>> >> >     > > 2. in order to write middleware-friendly clients that utilize
>> >> such
>> >> > a
>> >> >     > > container one would basically have to write their own
>> >> > producer/consumer
>> >> >     > API
>> >> >     > > on top of the open source kafka one.
>> >> >     > > 3. if you were going to go with a wrapper format you really
>> dont
>> >> > need to
>> >> >     > > bother with a kip (just open source your own client stack
>> from #2
>> >> > above
>> >> >     > so
>> >> >     > > others could stop re-inventing it)
>> >> >     > >
>> >> >     > > On Wed, Nov 2, 2016 at 4:25 PM, James Cheng <
>> >> [email protected]>
>> >> >     > wrote:
>> >> >     > >
>> >> >     > > > How exactly would this work? Or maybe that's out of scope
>> for
>> >> > this
>> >> >     > email.
>> >> >     > >
>> >> >     >
>> >> >
>> >> >
>> >> > The information contained in this email is strictly confidential and
>> for
>> >> > the use of the addressee only, unless otherwise indicated. If you are
>> not
>> >> > the intended recipient, please do not read, copy, use or disclose to
>> >> others
>> >> > this message or any attachment. Please also notify the sender by
>> replying
>> >> > to this email or by telephone (+44(020 7896 0011) and then delete the
>> >> email
>> >> > and any copies of it. Opinions, conclusion (etc) that do not relate to
>> >> the
>> >> > official business of this company shall be understood as neither given
>> >> nor
>> >> > endorsed by it. IG is a trading name of IG Markets Limited (a company
>> >> > registered in England and Wales, company number 04008957) and IG Index
>> >> > Limited (a company registered in England and Wales, company number
>> >> > 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
>> >> > London EC4R 2YA. Both IG Markets Limited (register number 195355) and
>> IG
>> >> > Index Limited (register number 114059) are authorised and regulated by
>> >> the
>> >> > Financial Conduct Authority.
>> >> >
>> >> The information contained in this email is strictly confidential and for
>> >> the use of the addressee only, unless otherwise indicated. If you are
>> not
>> >> the intended recipient, please do not read, copy, use or disclose to
>> others
>> >> this message or any attachment. Please also notify the sender by
>> replying
>> >> to this email or by telephone (+44(020 7896 0011) and then delete the
>> email
>> >> and any copies of it. Opinions, conclusion (etc) that do not relate to
>> the
>> >> official business of this company shall be understood as neither given
>> nor
>> >> endorsed by it. IG is a trading name of IG Markets Limited (a company
>> >> registered in England and Wales, company number 04008957) and IG Index
>> >> Limited (a company registered in England and Wales, company number
>> >> 01190902). Registered address at Cannon Bridge House, 25 Dowgate Hill,
>> >> London EC4R 2YA. Both IG Markets Limited (register number 195355) and IG
>> >> Index Limited (register number 114059) are authorised and regulated by
>> the
>> >> Financial Conduct Authority.
>> >>
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] KIP-82 - Add Record Headers

Reply via email to