Re: New Producer API - batched sync mode support

Gwen Shapira Mon, 27 Apr 2015 15:00:08 -0700

@Roshan - if the data was already written to Kafka, your approach will
generate LOTS of duplicates. I'm not convinced its ideal.


What's wrong with callbacks?

On Mon, Apr 27, 2015 at 2:53 PM, Roshan Naik <ros...@hortonworks.com> wrote:

> @Gwen
>
>  - A failure in delivery of one or more events in the batch (typical Flume
> case) is considered a failure of the entire batch and the client
> redelivers the entire batch.
>  - If clients want more fine grained control, alternative option is to
> indicate which events failed in the return value of  producer.send(list<>)
>
>
> @Joel
>   Fine grained tracking of status of individual events is quite painful in
> contrast to simply blocking on every batch. Old style Batched-sync mode
> has great advantages in terms of simplicity and performance.
>   Imagine a simple use case of client simply reading a directory of log
> files and splitting them into log messages/events and pushing them through
> kafka. Becomes more complex when all this tracking data needs to be
> persisted to accommodate for client restarts/crashes. Tracking a simple
> current line number and file name is easy to programming with and persist
> to accommodateŠ as opposed start/end position of each log message in the
> file.
>
>
> -roshan
>
>
>
>
>
> On 4/27/15 2:07 PM, "Joel Koshy" <jjkosh...@gmail.com> wrote:
>
> >As long as you retain the returned futures somewhere, you can always
> >iterate over the futures after the flush completes and check for
> >success/failure. Would that work for you?
> >
> >On Mon, Apr 27, 2015 at 08:53:36PM +0000, Roshan Naik wrote:
> >> The important guarantee that is needed for a client producer thread is
> >> that it requires an indication of success/failure of the batch of events
> >> it pushed. Essentially it needs to retry producer.send() on that same
> >> batch in case of failure. My understanding is that flush will simply
> >>flush
> >> data from all threads (correct me if I am wrong).
> >>
> >> -roshan
> >>
> >>
> >>
> >> On 4/27/15 1:36 PM, "Joel Koshy" <jjkosh...@gmail.com> wrote:
> >>
> >> >This sounds like flush:
> >>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-8+-+Add+a+flush+me
> >>>th
> >> >od+to+the+producer+API
> >> >
> >> >which was recently implemented in trunk.
> >> >
> >> >Joel
> >> >
> >> >On Mon, Apr 27, 2015 at 08:19:40PM +0000, Roshan Naik wrote:
> >> >> Been evaluating the perf of old and new Produce APIs for reliable
> >>high
> >> >>volume streaming data movement. I do see one area of improvement that
> >> >>the new API could use for synchronous clients.
> >> >>
> >> >> AFAIKT, the new API does not support batched synchronous transfers.
> >>To
> >> >>do synchronous send, one needs to do a future.get() after every
> >> >>Producer.send(). I changed the new
> >> >>o.a.k.clients.tools.ProducerPerformance tool to asses the perf of this
> >> >>mode of operation. May not be surprising that it much slower than the
> >> >>async mode... hard t push it beyond 4MB/s.
> >> >>
> >> >> The 0.8.1 Scala based producer API supported a batched sync mode via
> >> >>Producer.send( List<KeyedMessage> ) . My measurements show that it was
> >> >>able to approach (and sometimes exceed) the old async speeds...
> >>266MB/s
> >> >>
> >> >>
> >> >> Supporting this batched sync mode is very critical for streaming
> >> >>clients (such as flume for example) that need delivery guarantees.
> >> >>Although it can be done with Async mode, it requires additional book
> >> >>keeping as to which events are delivered and which ones are not. The
> >> >>programming model becomes much simpler with the batched sync mode.
> >> >>Client having to deal with one single future.get() helps performance
> >> >>greatly too as I noted.
> >> >>
> >> >> Wanted to propose adding this as an enhancement to the new Producer
> >>API.
> >> >
> >>
> >
>
>

Re: New Producer API - batched sync mode support

Reply via email to