Hi Phillip,

Thanks for you input.  I did evaluate Storm about 9 months ago before going
down the path of developing this myself on top of Kafka.
The primary reason for not using Storm was the inability to control
allocation of requests to processing elements.  This same requirement was
the reason for using the low-level Kafka consumer and producer rather than
the higher-level Kafka APIs (something I hope will be possible with the
redesigned APIs -
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design).
As Jay mentioned, using Storm would not fix the out-of-order delivery issue.
I will probably eventually couple Storm to our Kafka messaging, but will
need to fix https://github.com/nathanmarz/storm/issues/115 before I could
use it.

I am also about to look at Samza to see if it can help me avoid having to
write more code :-)

Thanks,
Ross



On 24 August 2013 00:34, Philip O'Toole <phi...@loggly.com> wrote:

> Ross -- thanks.
>
> How much code are you writing to do all this, post-Kafka? Have you
> considered Storm? I believe the Trident topologies can give you
> guaranteed-once semantics, so you may be interested in checking that
> out, if you have the time (I have not yet played with Trident stuff
> myself, but Storm in general, yes). Coupling Storm to Kafka is a very
> popular thing to do. Even without Trident, and just using Storm in a
> simpler mode, may save you from writing a ton of code.
>
> Philip
>
> On Thu, Aug 22, 2013 at 11:59 PM, Ross Black <ross.w.bl...@gmail.com>
> wrote:
> > Hi Phillip,
> >
> > If I can assume that all messages within a single partition are ordered
> the
> > same as delivery order, the state management to eliminate duplicates is
> far
> > simpler.
> >
> > I am using Kafka as the infrastructure for a streaming map/reduce style
> > solution, where throughput is critical.
> > Events are sent into topic A, which is partitioned based on event id.
> > Consumers of topic A generate data that is sent to a different topic B,
> > which is partitioned by a persistence key.  Consumers of topic B save the
> > data to a partitioned store.  Each stage can be single-threaded by the
> > partition, which results in zero contention on the partitioned data store
> > and massively improves the throughput.
> > Message offsets are used to end-to-end to eliminate duplicates, so the
> > application effectively achieves guaranteed once-only processing of
> > messages.  Currently, any out-of-order messages result in data being
> > dropped because duplicate tracking is based *only* on message offsets.
>  If
> > ordering within a partition is not guaranteed, I would need to track
> > maintain a list of message offsets that have been processed, rather than
> > having to track just the latest message offset for a partition (and would
> > need to persist this list of offsets to allow resume after failure).
> >
> > The assumption of guaranteed order is essential for the throughput the
> > application achieves.
> >
> > Thanks,
> > Ross
> >
> >
> >
> > On 23 August 2013 14:36, Philip O'Toole <phi...@loggly.com> wrote:
> >
> >> I am curious. What is it about your design that requires you track order
> >> so tightly? Maybe there is another way to meet your needs instead of
> >> relying on Kafka to do it.
> >>
> >> Philip
> >>
> >> On Aug 22, 2013, at 9:32 PM, Ross Black <ross.w.bl...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am using Kafka 0.7.1, and using the low-level SyncProducer to send
> >> > messages to a *single* partition from a *single* thread.
> >> > The client sends messages that contain sequential numbers so it is
> >> obvious
> >> > at the consumer when message order is shuffled.
> >> > I have noticed that messages can be saved out-or-order by Kafka when
> >> there
> >> > are connection problems, and am looking for possible solutions (I
> think I
> >> > already know the cause).
> >> >
> >> > The client sends messages in a retry loop so that it will wait for a
> >> short
> >> > period and then retry to send on any IO errors.  In SyncProducer, any
> >> > IOException triggers a disconnect.  Next time send is called a new
> >> > connection is established.  I believe that it is this
> >> disconnect/reconnect
> >> > cycle that can cause messages to be saved to the kafka log in a
> different
> >> > order to that of the client.
> >> >
> >> > I had previously had the same sort of issue with
> reconnect.interval/time,
> >> > which was fixed by disabling those reconnect settings.
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/kafka-users/201305.mbox/%3CCAM%2BbZhjssxmUhn_L%3Do0bGsD7PAXFGQHRpOKABcLz29vF3cNOzA%40mail.gmail.com%3E
> >> >
> >> > Is there anything in 0.7 that would allow me to solve this problem?
>  The
> >> > only option I can see at the moment is to not perform retries.
> >> >
> >> > Does 0.8 handle this issue any differently?
> >> >
> >> > Thanks,
> >> > Ross
> >>
>

Reply via email to