Hi Phillip,

If I can assume that all messages within a single partition are ordered the
same as delivery order, the state management to eliminate duplicates is far
simpler.

I am using Kafka as the infrastructure for a streaming map/reduce style
solution, where throughput is critical.
Events are sent into topic A, which is partitioned based on event id.
Consumers of topic A generate data that is sent to a different topic B,
which is partitioned by a persistence key.  Consumers of topic B save the
data to a partitioned store.  Each stage can be single-threaded by the
partition, which results in zero contention on the partitioned data store
and massively improves the throughput.
Message offsets are used to end-to-end to eliminate duplicates, so the
application effectively achieves guaranteed once-only processing of
messages.  Currently, any out-of-order messages result in data being
dropped because duplicate tracking is based *only* on message offsets.  If
ordering within a partition is not guaranteed, I would need to track
maintain a list of message offsets that have been processed, rather than
having to track just the latest message offset for a partition (and would
need to persist this list of offsets to allow resume after failure).

The assumption of guaranteed order is essential for the throughput the
application achieves.

Thanks,
Ross



On 23 August 2013 14:36, Philip O'Toole <phi...@loggly.com> wrote:

> I am curious. What is it about your design that requires you track order
> so tightly? Maybe there is another way to meet your needs instead of
> relying on Kafka to do it.
>
> Philip
>
> On Aug 22, 2013, at 9:32 PM, Ross Black <ross.w.bl...@gmail.com> wrote:
>
> > Hi,
> >
> > I am using Kafka 0.7.1, and using the low-level SyncProducer to send
> > messages to a *single* partition from a *single* thread.
> > The client sends messages that contain sequential numbers so it is
> obvious
> > at the consumer when message order is shuffled.
> > I have noticed that messages can be saved out-or-order by Kafka when
> there
> > are connection problems, and am looking for possible solutions (I think I
> > already know the cause).
> >
> > The client sends messages in a retry loop so that it will wait for a
> short
> > period and then retry to send on any IO errors.  In SyncProducer, any
> > IOException triggers a disconnect.  Next time send is called a new
> > connection is established.  I believe that it is this
> disconnect/reconnect
> > cycle that can cause messages to be saved to the kafka log in a different
> > order to that of the client.
> >
> > I had previously had the same sort of issue with reconnect.interval/time,
> > which was fixed by disabling those reconnect settings.
> >
> http://mail-archives.apache.org/mod_mbox/kafka-users/201305.mbox/%3CCAM%2BbZhjssxmUhn_L%3Do0bGsD7PAXFGQHRpOKABcLz29vF3cNOzA%40mail.gmail.com%3E
> >
> > Is there anything in 0.7 that would allow me to solve this problem?  The
> > only option I can see at the moment is to not perform retries.
> >
> > Does 0.8 handle this issue any differently?
> >
> > Thanks,
> > Ross
>

Reply via email to