Resuscitating this thread. I've done some more experiments and profiling. My messages are very tiny (currently 25 bytes) per message and creating multiple objects per message leads to a lot of churn. The memory churn through creation of convenience objects is more than the memory being used by my objects right now. I could probably batch my messages further, to make this effect less pronounced. I did some rather unscientific experiments with a flyweight approach on top of the ByteBuffer for a simple messaging API (peer to peer NIO based so not a real comparison) and the numbers were very satisfactory and there is no garbage created in steady state at all. Though I don't expect such good numbers from actually going through the broker + all the other extra stuff that a real producer would do, I think there is great potential here.
The general mechanism for me is this: i) A buffer (I used Unsafe but I imagine ByteBuffer having similar performance) is created per partition. ii) A CAS loop (in Java 7 and less) or even better unsafe.getAndAddInt() in Java 8 can be used to claim a chunk of bytes on the per topic buffer. This code can be invoked from multiple threads in a wait free manner (wait-free in Java 8, since getAndAddInt() is wait-free). Once a region in the buffer is claimed, it can be operated on using the flyweight method that we talked about. If the buffer doesn't have enough space then we can drop the message or move onto a new buffer. Further this creates absolutely zero objects in steady state (only a few objects created in the beginning). Even if the flyweight method is not desired, the API can just take byte arrays or objects that need to be serialized and copy them onto the per topic buffers in a similar way. This API has been validated in Aeron too, so I am pretty confident that it will work well. For the zero copy technique here is a link to Aeron API with zero copy - https://github.com/real-logic/Aeron/issues/18. The regular one copies byte arrays but without any object creation. iii) The producer send thread can then just go in FIFO order through the buffer sending messages that have been committed using NIO to rotate between brokers. We might need a background thread to zero out used buffers too. I've left out some details, but again none of this very revolutionary - it's mostly the same techniques used in Aeron. I really think that we can keep the API ga rbage free and wait-free (even in the multi producer case) without compromising how pretty it looks - the total zero copy API will low level, but it should only be used by advanced users. Moreover the usual producer.send(msg, topic, partition) can use the efficient ByteBuffer offset API internally without it itself creating any garbage. With the technique I talked about there is no need for an intermediate queue of any kind since the underlying ByteBuffer per partition acts as the queue. I can do more experiments with some real producer code instead of my toy code to further validate the idea, but I am pretty sure that both throughput and jitter characteristics will improve thanks to lower contention (wait-free in java 8 with a single getAndAddInt() operation for sync ) and better cache locality (C like buffers and a few constant number of objects per partition). If you guys are interested, I'd love to talk more. Again just to reiterate, I don't think the API will suffer at all - most of this can be done under the covers. Additionally it will open up things so that a low level zero copy API is possible. Thanks, Rajiv