Re: Kafka sending messages with zero copy

Rajiv Kurian Thu, 23 Oct 2014 18:14:04 -0700

I want to avoid allocations since I am using Java in a C mode. Even though
creating objects is a mere thread local pointer bump in Java, freeing them
is not so cheap and causes  uncontrollable jitter. The second motivation is
to avoid copying of data. Since I have objects which really look like C
structs that can be sent over the wire it's most efficient for me to write
them out in the very exact buffer that will be sent over the wire.

As for the bad API I completely agree -  it is a very C style API and
definitely not usable in a productive way by most developers. My point was
that this work is done by the protocol handling layer in any case, maybe it
can be extended to allow a user access to it's internals in a safe way both
during writing and reading. The present API then can be written as a layer
over this "ugly" non allocating API.

Re (1) and (2). Instead of giving out keys, values as bytes which implies
copies, I'd ideally like to scribble them straight into the buffer that you
are accumulating data onto before sending it. I am guessing you already
need a single buffer per partition or you have a single buffer per broker.
All of this probably implies a single threaded producer where I can be in
charge of the event loop.

Right now my data is within ByteBuffer/Unsafe buffer based data structures.
They can be put on the wire without any serialization step if I was using
Java NIO. Similarly they can be consumed on the other side without any
deserialization step. But with the current kafka API I have to:
  i) Copy data from my ByteBuffers onto new byte arrays.
 ii) Wrap byte arrays from (i) in a new object. I can't even re-use this
object since I don't know when kafka's send thread/serialization thread is
really done with it.
 iii) Write an encoder that just takes the byte array from this wrapper
object and hands it to Kafka.

Similarly on the consumer:
  i) Kafka will make copies of slices (representing user values) of the
ByteBuffer that was transferred from a broker into byte arrays.
 ii) Allocate an object  (using the decoder) that wraps these byte arrays
and hand them to me.

My imaginary (admittedly non-java-esque manual allocation style) API would
give me a pointer to Kafka's ByteBuffer that it has been accumulating
protocol messages on for either writing (on producer) or reading (on
consumer). I know it's a long shot but I still wanted to get the team's
thoughts on it. I'd be happy to contribute if we can come to an agreement
on the API design. My hypothesis is that if the internal protocol parsing
and buffer creation logic is written like this, it wouldn't be too tough to
expose it's innards and have the current encoding/decoding APIs just use
this low level API/

Thanks for listening to my rant.

On Thu, Oct 23, 2014 at 5:19 PM, Jay Kreps <jay.kr...@gmail.com> wrote:

> It sounds like you are primarily interested in optimizing the producer?
>
> There is no way to produce data without any allocation being done and I
> think getting to that would be pretty hard and lead to bad apis, but
> avoiding memory allocation entirely shouldn't be necessary. Small transient
> objects in java are pretty cheap to allocate and deallocate. The new Kafka
> producer API that is on trunk and will be in 0.8.2 is much more disciplined
> in it's usage of memory though there is still some allocation. The goal is
> to avoid copying the *data* multiple times, even if we do end up creating
> some small helper objects along the way (the idea is that the data may be
> rather large).
>
> If you wanted to further optimize the new producer there are two things
> that could be done that would help:
> 1. Avoid the copy when creating the ProducerRecord instance. This could be
> done by accepting a length/offset along with the key and value and making
> use of this when writing to the records instance. As it is your key and
> value need to be complete byte arrays.
> 2. Avoid the copy during request serialization. This is a little trickier.
> During request serialization we need to take the records for each partition
> and create a request that contains all of them. It is possible to do this
> with no further recopying of data but somewhat tricky.
>
> My recommendation would be to try the new producer api and see how that
> goes. If you need to optimize further we would definitely take patches for
> (1) and (2).
>
> -Jay
>
> On Thu, Oct 23, 2014 at 4:03 PM, Rajiv Kurian <ra...@signalfuse.com>
> wrote:
>
> > I have a flyweight style protocol that I use for my messages. Thus they
> > require no serialization/deserialization to be processed. The messages
> are
> > just offset, length pairs within a ByteBuffer.
> >
> > Is there a producer and consumer API that forgoes allocation? I just want
> > to give the kakfa producer offsets from a ByteBuffer. Similarly it would
> be
> > ideal if I could get a ByteBuffer and offsets into it from the consumer.
> > Even if I could get byte arrays (implying a copy but no decoding phase)
> on
> > the consumer that would be great. Right now it seems to me that the only
> > way to get messages from Kafka is through a message object, which implies
> > Kafka allocates these messages all the time. I am willing to use the
> > upcoming 0.9 API too.
> >
> > Thanks.
> >
>

Re: Kafka sending messages with zero copy

Reply via email to