Re: New Producer Public API

Andrey Yegorov Fri, 24 Jan 2014 13:05:53 -0800

I love the callback in send() but I do not see how it helps in case of an
error.


Imagine the usecase: I want to write messages to the log so I can replay
them to kafka later in case if async send failed.
>From a brief look at the API I see that I'll get back RecordSend object
(which is not true already - it was not send in case of error.) From that
object I can get some info about the error and offset. How d I get the
original message back so I can write it to the log? Can you please provide
an example?



----------
Andrey Yegorov


On Fri, Jan 24, 2014 at 11:54 AM, Jay Kreps <jay.kr...@gmail.com> wrote:

> As mentioned in a previous email we are working on a re-implementation of
> the producer. I would like to use this email thread to discuss the details
> of the public API and the configuration. I would love for us to be
> incredibly picky about this public api now so it is as good as possible and
> we don't need to break it in the future.
>
> The best way to get a feel for the API is actually to take a look at the
> javadoc, my hope is to get the api docs good enough so that it is
> self-explanatory:
>
> http://empathybox.com/kafka-javadoc/index.html?kafka/clients/producer/KafkaProducer.html
>
> Please take a look at this API and give me any thoughts you may have!
>
> It may also be reasonable to take a look at the configs:
>
> http://empathybox.com/kafka-javadoc/kafka/clients/producer/ProducerConfig.html
>
> The actual code is posted here:
> https://issues.apache.org/jira/browse/KAFKA-1227
>
> A few questions or comments to kick things off:
> 1. We need to make a decision on whether serialization of the user's key
> and value should be done by the user (with our api just taking byte[]) or
> if we should take an object and allow the user to configure a Serializer
> class which we instantiate via reflection. We take the later approach in
> the current producer, and I have carried this through to this prototype.
> The tradeoff I see is this: taking byte[] is actually simpler, the user can
> directly do whatever serialization they like. The complication is actually
> partitioning. Currently partitioning is done by a similar plug-in api
> (Partitioner) which the user can implement and configure to override how
> partitions are assigned. If we take byte[] as input then we have no access
> to the original object and partitioning MUST be done on the byte[]. This is
> fine for hash partitioning. However for various types of semantic
> partitioning (range partitioning, or whatever) you would want access to the
> original object. In the current approach a producer who wishes to send
> byte[] they have serialized in their own code can configure the
> BytesSerialization we supply which is just a "no op" serialization.
> 2. We should obsess over naming and make sure each of the class names are
> good.
> 3. Jun has already pointed out that we need to include the topic and
> partition in the response, which is absolutely right. I haven't done that
> yet but that definitely needs to be there.
> 4. Currently RecordSend.await will throw an exception if the request
> failed. The intention here is that producer.send(message).await() exactly
> simulates a synchronous call. Guozhang has noted that this is a little
> annoying since the user must then catch exceptions. However if we remove
> this then if the user doesn't check for errors they won't know one has
> occurred, which I predict will be a common mistake.
> 5. Perhaps there is more we could do to make the async callbacks and future
> we give back intuitive and easy to program against?
>
> Some background info on implementation:
>
> At a high level the primary difference in this producer is that it removes
> the distinction between the "sync" and "async" producer. Effectively all
> requests are sent asynchronously but always return a future response object
> that gives the offset as well as any error that may have occurred when the
> request is complete. The batching that is done in the async producer only
> today is done whenever possible now. This means that the sync producer,
> under load, can get performance as good as the async producer (preliminary
> results show the producer getting 1m messages/sec). This works similar to
> group commit in databases but with respect to the actual network
> transmission--any messages that arrive while a send is in progress are
> batched together. It is also possible to encourage batching even under low
> load to save server resources by introducing a delay on the send to allow
> more messages to accumulate; this is done using the linger.ms config (this
> is similar to Nagle's algorithm in TCP).
>
> This producer does all network communication asynchronously and in parallel
> to all servers so the performance penalty for acks=-1 and waiting on
> replication should be much reduced. I haven't done much benchmarking on
> this yet, though.
>
> The high level design is described a little here, though this is now a
> little out of date:
> https://cwiki.apache.org/confluence/display/KAFKA/Client+Rewrite
>
> -Jay
>

Re: New Producer Public API

Reply via email to