Ack, nice, should have thought of doing that...

-Jay


On Mon, Feb 10, 2014 at 10:12 AM, Neha Narkhede <neha.narkh...@gmail.com>wrote:

> Added this to our FAQ -
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIgetexactlyonemessagingfromKafka
> ?
>
>
>
> On Mon, Feb 10, 2014 at 9:46 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > The out-of-the-box support for this in Kafka isn't great right now.
> >
> > Exactly once semantics has two parts: avoiding duplication during data
> > production and avoiding duplicates during data consumption.
> >
> > There are two approaches to getting exactly once semantics during data
> > production.
> >
> > 1. Use a single-writer per partition and every time you get a network
> error
> > check the last message in that partition to see if your last write
> > succeeded
> > 2. Include a primary key (UUID or something) in the message and
> deduplicate
> > on the consumer.
> >
> > If you do one of these things the log that Kafka hosts will be duplicate
> > free. However reading without duplicates depends on some co-operation
> from
> > the consumer too. If the consumer is periodically checkpointing its
> > position then if it fails and restarts it will restart from the
> > checkpointed position. Thus if the data output and the checkpoint are not
> > written atomically it will be possible to get duplicates here as well.
> This
> > problem is particular to your storage system. For example if you are
> using
> > a database you could commit these together in a transaction. The HDFS
> > loader Camus that LinkedIn wrote does something like this for Hadoop
> loads.
> > The other alternative that doesn't require a transaction is to store the
> > offset with the data loaded and deduplicate using the
> > topic/partition/offset combination.
> >
> > I think there are two improvements that would make this a lot easier:
> > 1. I think producer idempotence is something that could be done
> > automatically and much more cheaply by optionally integrating support for
> > this on the server.
> > 2. The existing high-level consumer doesn't expose a lot of the more fine
> > grained control of offsets (e.g. to reset your position). We will be
> > working on that soon.
> >
> > -Jay
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Feb 10, 2014 at 9:11 AM, Garry Turkington <
> > g.turking...@improvedigital.com> wrote:
> >
> > > Hi,
> > >
> > > I've been doing some prototyping on Kafka for a few months now and like
> > > what I see. It's a good fit for some of my use cases in the areas of
> data
> > > distribution but also for processing - liking a lot of what I see in
> > Samza.
> > > I'm now working through some of the operational issues and have a
> > question
> > > to the community.
> > >
> > > I have several data sources that I want to push into Kafka but some of
> > the
> > > most important are arriving as a stream of files being dropped either
> > into
> > > a SFTP location or S3. Conceptually the data is really a stream but its
> > > being chunked and made more batch by the deployment model of the
> > > operational servers. So pulling the data into Kafka and seeing it more
> > as a
> > > stream again is a big plus.
> > >
> > > But, I really don't want duplicate messages. I know Kafka provides at
> > > least once semantics and that's fine, I'm happy to have the de-dupe
> logic
> > > external to Kafka. And if I look at my producer I can build up a
> protocol
> > > around adding record metadata and using Zookeeper to give me pretty
> high
> > > confidence that my clients will know if they are reading from a file
> that
> > > was fully published into Kafka or not.
> > >
> > > I had assumed that this wouldn't be a unique use case but on doing a
> > bunch
> > > of searches I really don't find much in terms of either tools that help
> > or
> > > even just best practice patterns for handling this type of need to
> > support
> > > exactly-once message processing.
> > >
> > > So now I'm thinking that either I just need better web search skills or
> > > that actually this isn't something many others are doing and if so then
> > > there's likely a reason for that.
> > >
> > > Any thoughts?
> > >
> > > Thanks
> > > Garry
> > >
> > >
> >
>

Reply via email to