Re: Arguments for Kafka over RabbitMQ ?

Alexis Richardson Sat, 08 Jun 2013 13:21:05 -0700

A few more details for those following this:

On Sat, Jun 8, 2013 at 9:09 PM, Alexis Richardson
<alexis.richard...@gmail.com> wrote:
> Jonathan
>
> I am aware of the difference between sequential writes and other kinds
> of writes ;p)
>
> AFAIK the Kafka docs describe a sort of platonic alternative system,
> eg "normally people do this.. Kafka does that..".  This is a good way
> to explain design decisions.  However, I think you may be assuming
> that Rabbit is a lot like the generalised other system.  But it is not
> - eg Rabbit does not do lots of random IO.  I'm led to understand that
> Rabbit's msg store is closer to log structured storage (a la
> Log-Structured Merge Trees) in some ways.
...
>>
>> That would be awesome if you can confirm what Rabbit is using as a
>> persistent data structure.


See extensive comments in here:

http://hg.rabbitmq.com/rabbitmq-server/file/bc2fda987fe8/src/rabbit_msg_store.erl


>> More importantly, whether it is BTree or
>> something else, is the disk i/o random or linear?
..
>> This is only speaking of the use case of high throughput with persisting
>> large amounts of data to disk where there is 4 orders of magnitude more
>> than 10x difference.  It all comes down to random vs sequential
>> writes/reads to disk as I mentioned above.

It's not a btree with random writes, hence my puzzlement earlier.

* there are mostly linear writes in a file
* multiple files are involved, moved around, garbage collected,
compacted, etc, which is obviously not all linear.

This will behave better than a btree for the purpose it was built for.

This is just for writes.  Reads may be a different story - and I don't
fully understand how reads work in Kafka.  A memory mapped circular
buffer will definitely outperform this... mmap support for erlang
would be nice ;p)




>>
>> On Sat, Jun 8, 2013 at 2:07 AM, Alexis Richardson <
>> alexis.richard...@gmail.com> wrote:
>>
>>> Jonathan
>>>
>>> On Sat, Jun 8, 2013 at 2:09 AM, Jonathan Hodges <hodg...@gmail.com> wrote:
>>> > Thanks so much for your replies.  This has been a great help
>>> understanding
>>> > Rabbit better with having very little experience with it.  I have a few
>>> > follow up comments below.
>>>
>>> Happy to help!
>>>
>>> I'm afraid I don't follow your arguments below.  Rabbit contains many
>>> optimisations too.  I'm told that it is possible to saturate the disk
>>> i/o, and you saw the message rates I quoted in the previous email.
>>> YES of course there are differences, mostly an accumulation of things.
>>>  For example Rabbit spends more time doing work before it writes to
>>> disk.
>>>
>>> You said:
>>>
>>> "Since Rabbit must maintain the state of the
>>> consumers I imagine it’s subjected to random data access patterns on disk
>>> as opposed to sequential."
>>>
>>> I don't follow the logic here, sorry.
>>>
>>> Couple of side comments:
>>>
>>> * In your Hadoop vs RT example, Rabbit would deliver the RT messages
>>> immediately and write the rest to disk.  It can do this at high rates
>>> - I shall try to get you some useful data here.
>>>
>>> * Bear in mind that write speed should be orthogonal to read speed.
>>> Ask yourself - how would Kafka provide a read cache, and when might
>>> that be useful?
>>>
>>> * I'll find out what data structure Rabbit uses for long term persistence.
>>>
>>>
>>> "Quoting the Kafka design page (
>>> http://kafka.apache.org/07/design.html) performance of sequential writes
>>> on
>>> a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
>>> random writes is only about 50k/sec—a difference of nearly 10000X."
>>>
>>> Depending on your use case, I'd expect 2x-10x overall throughput
>>> differences, and will try to find out more info.  As I said, Rabbit
>>> can saturate disk i/o.
>>>
>>> alexis
>>>
>>>
>>>
>>>
>>> >
>>> >> While you are correct the payload is a much bigger concern, managing the
>>> >> metadata and acks centrally on the broker across multiple clients at
>>> scale
>>> >> is also a concern.  This would seem to be exasperated if you have
>>> > consumers
>>> >> at different speeds i.e. Storm and Hadoop consuming the same topic.
>>> >>
>>> >> In that scenario, say storm consumes the topic messages in real-time and
>>> >> Hadoop consumes once a day.  Let’s assume the topic consists of 100k+
>>> >> messages/sec throughput so that in a given day you might have 100s GBs
>>> of
>>> >> data flowing through the topic.
>>> >>
>>> >> To allow Hadoop to consume once a day, Rabbit obviously can’t keep 100s
>>> > GBs
>>> >> in memory and will need to persist this data to its internal DB to be
>>> >> retrieved later.
>>> >
>>> > I am not sure why you think this is a problem?
>>> >
>>> > For a fixed number of producers and consumers, the pubsub and delivery
>>> > semantics of Rabbit and Kafka are quite similar.  Think of Rabbit as
>>> > adding an in-memory cache that is used to (a) speed up read
>>> > consumption, (b) obviate disk writes when possible due to all client
>>> > consumers being available and consuming.
>>> >
>>> >
>>> > Actually I think this is the main use case that sets Kafka apart from
>>> > Rabbit and speaks to the poster’s ‘Arguments for Kafka over RabbitMQ’
>>> > question.  As you mentioned Rabbit is a general purpose messaging system
>>> > and along with that has a lot of features not found in Kafka.  There are
>>> > plenty of times when Rabbit makes more sense than Kafka, but not when you
>>> > are maintaining large message stores and require high throughput to disk.
>>> >
>>> > Persisting 100s GBs of messages to disk is a much different problem than
>>> > managing messages in memory.  Since Rabbit must maintain the state of the
>>> > consumers I imagine it’s subjected to random data access patterns on disk
>>> > as opposed to sequential.  Quoting the Kafka design page (
>>> > http://kafka.apache.org/07/design.html) performance of sequential
>>> writes on
>>> > a 6 7200rpm SATA RAID-5 array is about 300MB/sec but the performance of
>>> > random writes is only about 50k/sec—a difference of nearly 10000X.
>>> >
>>> > They go on to say persistent data structure used in messaging systems
>>> > metadata is often a BTree. BTrees are the most versatile data structure
>>> > available, and make it possible to support a wide variety of
>>> transactional
>>> > and non-transactional semantics in the messaging system. They do come
>>> with
>>> > a fairly high cost, though: Btree operations are O(log N). Normally O(log
>>> > N) is considered essentially equivalent to constant time, but this is not
>>> > true for disk operations. Disk seeks come at 10 ms a pop, and each disk
>>> can
>>> > do only one seek at a time so parallelism is limited. Hence even a
>>> handful
>>> > of disk seeks leads to very high overhead. Since storage systems mix very
>>> > fast cached operations with actual physical disk operations, the observed
>>> > performance of tree structures is often superlinear. Furthermore BTrees
>>> > require a very sophisticated page or row locking implementation to avoid
>>> > locking the entire tree on each operation. The implementation must pay a
>>> > fairly high price for row-locking or else effectively serialize all
>>> reads.
>>> > Because of the heavy reliance on disk seeks it is not possible to
>>> > effectively take advantage of the improvements in drive density, and one
>>> is
>>> > forced to use small (< 100GB) high RPM SAS drives to maintain a sane
>>> ratio
>>> > of data to seek capacity.
>>> >
>>> > Intuitively a persistent queue could be built on simple reads and appends
>>> > to files as is commonly the case with logging solutions. Though this
>>> > structure would not support the rich semantics of a BTree implementation,
>>> > but it has the advantage that all operations are O(1) and reads do not
>>> > block writes or each other. This has obvious performance advantages since
>>> > the performance is completely decoupled from the data size--one server
>>> can
>>> > now take full advantage of a number of cheap, low-rotational speed 1+TB
>>> > SATA drives. Though they have poor seek performance, these drives often
>>> > have comparable performance for large reads and writes at 1/3 the price
>>> and
>>> > 3x the capacity.
>>> >
>>> > Having access to virtually unlimited disk space without penalty means
>>> that
>>> > we can provide some features not usually found in a messaging system. For
>>> > example, in kafka, instead of deleting a message immediately after
>>> > consumption, we can retain messages for a relative long period (say a
>>> week).
>>> >
>>> > Our assumption is that the volume of messages is extremely high, indeed
>>> it
>>> > is some multiple of the total number of page views for the site (since a
>>> > page view is one of the activities we process). Furthermore we assume
>>> each
>>> > message published is read at least once (and often multiple times), hence
>>> > we optimize for consumption rather than production.
>>> >
>>> > There are two common causes of inefficiency: too many network requests,
>>> and
>>> > excessive byte copying.
>>> >
>>> > To encourage efficiency, the APIs are built around a "message set"
>>> > abstraction that naturally groups messages. This allows network requests
>>> to
>>> > group messages together and amortize the overhead of the network
>>> roundtrip
>>> > rather than sending a single message at a time.
>>> >
>>> > The MessageSet implementation is itself a very thin API that wraps a byte
>>> > array or file. Hence there is no separate serialization or
>>> deserialization
>>> > step required for message processing, message fields are lazily
>>> > deserialized as needed (or not deserialized if not needed).
>>> >
>>> > The message log maintained by the broker is itself just a directory of
>>> > message sets that have been written to disk. This abstraction allows a
>>> > single byte format to be shared by both the broker and the consumer (and
>>> to
>>> > some degree the producer, though producer messages are checksumed and
>>> > validated before being added to the log).
>>> >
>>> > Maintaining this common format allows optimization of the most important
>>> > operation: network transfer of persistent log chunks. Modern unix
>>> operating
>>> > systems offer a highly optimized code path for transferring data out of
>>> > pagecache to a socket; in Linux this is done with the sendfile system
>>> call.
>>> > Java provides access to this system call with the FileChannel.transferTo
>>> > api.
>>> >
>>> > To understand the impact of sendfile, it is important to understand the
>>> > common data path for transfer of data from file to socket:
>>> >
>>> >   1. The operating system reads data from the disk into pagecache in
>>> kernel
>>> > space
>>> >   2. The application reads the data from kernel space into a user-space
>>> > buffer
>>> >   3. The application writes the data back into kernel space into a socket
>>> > buffer
>>> >   4. The operating system copies the data from the socket buffer to the
>>> NIC
>>> > buffer where it is sent over the network
>>> >
>>> > This is clearly inefficient, there are four copies, two system calls.
>>> Using
>>> > sendfile, this re-copying is avoided by allowing the OS to send the data
>>> > from pagecache to the network directly. So in this optimized path, only
>>> the
>>> > final copy to the NIC buffer is needed.
>>> >
>>> > We expect a common use case to be multiple consumers on a topic. Using
>>> the
>>> > zero-copy optimization above, data is copied into pagecache exactly once
>>> > and reused on each consumption instead of being stored in memory and
>>> copied
>>> > out to kernel space every time it is read. This allows messages to be
>>> > consumed at a rate that approaches the limit of the network connection.
>>> >
>>> >
>>> > So in the end it would seem Kafka’s specialized nature to write data
>>> first
>>> > really shines over Rabbit when your use case requires a very high
>>> > throughput unblocking firehose with large data persistence to disk.
>>>  Since
>>> > this is only one use case this by no means is saying Kafka is better than
>>> > Rabbit or vice versa.  I think it is awesome there are more options to
>>> > choose from so you can pick the right tool for the job.  Thanks open
>>> source!
>>> >
>>> > As always YMMV.
>>> >
>>> >
>>> >
>>> > On Fri, Jun 7, 2013 at 4:40 PM, Alexis Richardson <
>>> > alexis.richard...@gmail.com> wrote:
>>> >
>>> >> Jonathan,
>>> >>
>>> >>
>>> >> On Fri, Jun 7, 2013 at 7:03 PM, Jonathan Hodges <hodg...@gmail.com>
>>> wrote:
>>> >> > Hi Alexis,
>>> >> >
>>> >> > I appreciate your reply and clarifications to my misconception about
>>> >> > Rabbit, particularly on the copying of the message payloads per
>>> consumer.
>>> >>
>>> >> Thank-you!
>>> >>
>>> >>
>>> >> >  It sounds like it only copies metadata like the consumer state i.e.
>>> >> > position in the topic messages.
>>> >>
>>> >> Basically yes.  Of course when a message is delivered to N>1
>>> >> *machines*, then there will be N copies, one per machine.
>>> >>
>>> >> Also, for various reasons, very tiny (<60b) messages do get copied as
>>> >> you'd assumed.
>>> >>
>>> >>
>>> >> > I don’t have experience with Rabbit and
>>> >> > was basing this assumption based on Google searches like the
>>> following -
>>> >> >
>>> >>
>>> http://ilearnstack.com/2013/04/16/introduction-to-amqp-messaging-with-rabbitmq/
>>> >> .
>>> >> >  It seems to indicate with topic exchanges that the messages get
>>> copied
>>> >> to
>>> >> > a queue per consumer, but I am glad you confirmed it is just the
>>> >> metadata.
>>> >>
>>> >> Yup.
>>> >>
>>> >> That's a fairly decent article but even the good stuff uses words like
>>> >> "copy" without a fixed denotation.  Don't believe the internets!
>>> >>
>>> >>
>>> >> > While you are correct the payload is a much bigger concern, managing
>>> the
>>> >> > metadata and acks centrally on the broker across multiple clients at
>>> >> scale
>>> >> > is also a concern.  This would seem to be exasperated if you have
>>> >> consumers
>>> >> > at different speeds i.e. Storm and Hadoop consuming the same topic.
>>> >> >
>>> >> > In that scenario, say storm consumes the topic messages in real-time
>>> and
>>> >> > Hadoop consumes once a day.  Let’s assume the topic consists of 100k+
>>> >> > messages/sec throughput so that in a given day you might have 100s
>>> GBs of
>>> >> > data flowing through the topic.
>>> >> >
>>> >> > To allow Hadoop to consume once a day, Rabbit obviously can’t keep
>>> 100s
>>> >> GBs
>>> >> > in memory and will need to persist this data to its internal DB to be
>>> >> > retrieved later.
>>> >>
>>> >> I am not sure why you think this is a problem?
>>> >>
>>> >> For a fixed number of producers and consumers, the pubsub and delivery
>>> >> semantics of Rabbit and Kafka are quite similar.  Think of Rabbit as
>>> >> adding an in-memory cache that is used to (a) speed up read
>>> >> consumption, (b) obviate disk writes when possible due to all client
>>> >> consumers being available and consuming.
>>> >>
>>> >>
>>> >> > I believe when large amounts of data need to be persisted
>>> >> > is the scenario described in the earlier posted Kafka paper (
>>> >> >
>>> >>
>>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
>>> >> )
>>> >> > where Rabbit’s performance really starts to bog down as compared to
>>> >> Kafka.
>>> >>
>>> >> Not sure what parts of the paper you mean?
>>> >>
>>> >> I read that paper when it came out.  I found it strongest when
>>> >> describing Kafka's design philosophy.  I found the performance
>>> >> statements made about Rabbit pretty hard to understand.  This is not
>>> >> meant to be a criticism of the authors!  I have seen very few
>>> >> performance papers about messaging that I would base decisions on.
>>> >>
>>> >>
>>> >> > This Kafka paper is looks to be a few years old
>>> >>
>>> >> Um....  Lots can change in technology very quickly :-)
>>> >>
>>> >> Eg.: At the time this paper was published, Instagram had 5m users.
>>> >> Six months earlier in Dec 2010, it had 1m.  Since then it grew huge
>>> >> and got acquired.
>>> >>
>>> >>
>>> >>
>>> >> > so has something changed
>>> >> > within the Rabbit architecture to alleviate this issue when large
>>> amounts
>>> >> > of data are persisted to the internal DB?
>>> >>
>>> >> Rabbit introduced a new internal flow control system which impacted
>>> >> performance under steady load.  This may be relevant?  I couldn't say
>>> >> from reading the paper.
>>> >>
>>> >> I don't have a good reference for this to hand, but here is a post
>>> >> about external flow control that you may find amusing:
>>> >>
>>> >>
>>> http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
>>> >>
>>> >>
>>> >> > Do the producer and consumer
>>> >> > numbers look correct?  If no, maybe you can share some Rabbit
>>> benchmarks
>>> >> > under this scenario, because I believe it is the main area where Kafka
>>> >> > appears to be the superior solution.
>>> >>
>>> >> This is from about one year ago:
>>> >>
>>> >>
>>> http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/
>>> >>
>>> >> Obviously none of this uses batching, which is an easy trick for
>>> >> increasing throughput.
>>> >>
>>> >> YMMV.
>>> >>
>>> >> Is this helping?
>>> >>
>>> >> alexis
>>> >>
>>> >>
>>> >>
>>> >> > Thanks for educating me on these matters.
>>> >> >
>>> >> > -Jonathan
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Fri, Jun 7, 2013 at 6:54 AM, Alexis Richardson <
>>> ale...@rabbitmq.com
>>> >> >wrote:
>>> >> >
>>> >> >> Hi
>>> >> >>
>>> >> >> Alexis from Rabbit here.  I hope I am not intruding!
>>> >> >>
>>> >> >> It would be super helpful if people with questions, observations or
>>> >> >> moans posted them to the rabbitmq list too :-)
>>> >> >>
>>> >> >> A few comments:
>>> >> >>
>>> >> >> * Along with ZeroMQ, I consider Kafka to be one of the interesting
>>> and
>>> >> >> useful messaging projects out there.  In a world of cruft, Kafka is
>>> >> >> cool!
>>> >> >>
>>> >> >> * This is because both projects come at messaging from a specific
>>> >> >> point of view that is *different* from Rabbit.  OTOH, many other
>>> >> >> projects exist that replicate Rabbit features for fun, or NIH, or due
>>> >> >> to misunderstanding the semantics (yes, our docs could be better)
>>> >> >>
>>> >> >> * It is striking how few people describe those differences.  In a
>>> >> >> nutshell they are as follows:
>>> >> >>
>>> >> >> *** Kafka writes all incoming data to disk immediately, and then
>>> >> >> figures out who sees what.  So it is much more like a database than
>>> >> >> Rabbit, in that new consumers can appear well after the disk write
>>> and
>>> >> >> still subscribe to past messages.  Instead, Rabbit which tries to
>>> >> >> deliver to consumers and buffers otherwise.  Persistence is optional
>>> >> >> but robust and a feature of the buffer ("queue") not the upstream
>>> >> >> machinery.  Rabbit is able to cache-on-arrival via a plugin, but this
>>> >> >> is a design overlay and not particularly optimal.
>>> >> >>
>>> >> >> *** Kafka is a client server system with end to end semantics.  It
>>> >> >> defines order to include processing order, and keeps state on the
>>> >> >> client to do this.  Group management is via a 3rd party service
>>> >> >> (Zookeeper? I forget which).  Rabbit is a server-only protocol based
>>> >> >> system which maintains order on the server and through completely
>>> >> >> language neutral protocol semantics.  This makes Rabbit perhaps more
>>> >> >> natural as a 'messaging service' eg for integration and other
>>> >> >> inter-app data transfer.
>>> >> >>
>>> >> >> *** Rabbit is a general purpose messaging system with extras like
>>> >> >> federation.  It speaks many protocols, and has core features like HA,
>>> >> >> transactions, management, etc.  Everything can be switched on or off.
>>> >> >> Getting all this to work while keeping the install light and fast, is
>>> >> >> quite fiddly.  Kafka by contrast comes from a specific set of use
>>> >> >> cases, which are interesting certainly.  I am not sure if Kafka wants
>>> >> >> to be a general purpose messaging system, but it will become a bit
>>> >> >> more like Rabbit if that is the goal.
>>> >> >>
>>> >> >> *** Both approaches have costs.  In the case of Rabbit the cost is
>>> >> >> that more metadata is stored on the broker.  Kafka can get
>>> performance
>>> >> >> gains by storing less such data.  But we are talking about some N
>>> >> >> thousands of MPS versus some M thousands.  At those speeds the
>>> clients
>>> >> >> are usually the bottleneck anyway.
>>> >> >>
>>> >> >> * Let me also clarify some things:
>>> >> >>
>>> >> >> *** Rabbit does NOT store multiple copies of the same message across
>>> >> >> queues, unless they are very small (<60b, iirc).  A message delivered
>>> >> >> to >1 queue on 1 machine is stored once.  Metadata about that message
>>> >> >> may be stored more than once, but, at scale, the big cost is the
>>> >> >> payload.
>>> >> >>
>>> >> >> *** Rabbit's vanilla install does store some index data in memory
>>> when
>>> >> >> messages flow to disk.  You can change this by using a plugin, but
>>> >> >> this is a secret-menu undocumented feature.  Very very few people
>>> need
>>> >> >> any such thing.
>>> >> >>
>>> >> >> *** A Rabbit queue is lightweight.  It's just an ordered consumption
>>> >> >> buffer that can persist and ack.  Don't assume things about Rabbit
>>> >> >> queues based on what you know about IBM MQ, JMS, and so forth.
>>>  Queues
>>> >> >> in Rabbit and Kafka are not the same.
>>> >> >>
>>> >> >> *** Rabbit does not use mnesia for message storage.  It has its own
>>> >> >> DB, optimised for messaging.  You can use other DBs but this is
>>> >> >> Complicated.
>>> >> >>
>>> >> >> *** Rabbit does all kinds of batching and bulk processing, and can
>>> >> >> batch end to end.  If you see claims about batching, buffering, etc.,
>>> >> >> find out ALL the details before drawing conclusions.
>>> >> >>
>>> >> >> I hope this is helpful.
>>> >> >>
>>> >> >> Keen to get feedback / questions / corrections.
>>> >> >>
>>> >> >> alexis
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Jun 7, 2013 at 2:09 AM, Marc Labbe <mrla...@gmail.com>
>>> wrote:
>>> >> >> > We also went through the same decision making and our arguments for
>>> >> Kafka
>>> >> >> > where in the same lines as those Jonathan mentioned. The fact that
>>> we
>>> >> >> have
>>> >> >> > heterogeneous consumers is really a deciding factor. Our
>>> requirements
>>> >> >> were
>>> >> >> > to avoid loosing messages at all cost while having multiple
>>> consumers
>>> >> >> > reading the same data at a different pace. On one side, we have a
>>> few
>>> >> >> > consumers being fed with data coming in from most, if not all,
>>> >> topics. On
>>> >> >> > the other side, we have a good bunch of consumers reading only
>>> from a
>>> >> >> > single topic. The big guys can take their time to read while the
>>> >> smaller
>>> >> >> > ones are mostly for near real-time events so they need to keep up
>>> the
>>> >> >> pace
>>> >> >> > of incoming messages.
>>> >> >> >
>>> >> >> > RabbitMQ stores data on disk only if you tell it to while Kafka
>>> >> persists
>>> >> >> by
>>> >> >> > design. From the beginning, we decided we would try to use the
>>> queues
>>> >> the
>>> >> >> > same way, pub/sub with a routing key (an exchange in RabbitMQ) or
>>> >> topic,
>>> >> >> > persisted to disk and replicated.
>>> >> >> >
>>> >> >> > One of our scenario was to see how the system would cope with the
>>> >> largest
>>> >> >> > consumer down for a while, therefore forcing the brokers to keep
>>> the
>>> >> data
>>> >> >> > for a long period. In the case of RabbitMQ, this consumer has it
>>> owns
>>> >> >> queue
>>> >> >> > and data grows on disk, which is not really a problem if you plan
>>> >> >> > consequently. But, since it has to keep track of all messages read,
>>> >> the
>>> >> >> > Mnesia database used by RabbitMQ as the messages index also grows
>>> >> pretty
>>> >> >> > big. At that point, the amount of RAM necessary becomes very large
>>> to
>>> >> >> keep
>>> >> >> > the level of performance we need. In our tests, we found that this
>>> an
>>> >> >> > adverse effect on ALL the brokers, thus affecting all consumers.
>>> You
>>> >> can
>>> >> >> > always say that you'll monitor the consumers to make sure it won't
>>> >> >> happen.
>>> >> >> > That's a good thing if you can. I wasn't ready to make that bet.
>>> >> >> >
>>> >> >> > Another point is the fact that, since we wanted to use pub/sub
>>> with a
>>> >> >> > exchange in RabbitMQ, we would have ended up with a lot data
>>> >> duplication
>>> >> >> > because if a message is read by multiple consumers, it will get
>>> >> >> duplicated
>>> >> >> > in the queue of each of those consumer. Kafka wins on that side too
>>> >> since
>>> >> >> > every consumer reads from the same source.
>>> >> >> >
>>> >> >> > The downsides of Kafka were the language issues (we are using
>>> mostly
>>> >> >> Python
>>> >> >> > and C#). 0.8 is very new and few drivers are available at this
>>> point.
>>> >> >> Also,
>>> >> >> > we will have to try getting as close as possible to
>>> once-and-only-once
>>> >> >> > guarantee. There are two things where RabbitMQ would have given us
>>> >> less
>>> >> >> > work out of the box as opposed to Kafka. RabbitMQ also provides a
>>> >> bunch
>>> >> >> of
>>> >> >> > tools that makes it rather attractive too.
>>> >> >> >
>>> >> >> > In the end, looking at throughput is a pretty nifty thing but being
>>> >> sure
>>> >> >> > that I'll be able to manage the beast as it grows will allow me to
>>> >> get to
>>> >> >> > sleep way more easily.
>>> >> >> >
>>> >> >> >
>>> >> >> > On Thu, Jun 6, 2013 at 3:28 PM, Jonathan Hodges <hodg...@gmail.com
>>> >
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> We just went through a similar exercise with RabbitMQ at our
>>> company
>>> >> >> with
>>> >> >> >> streaming activity data from our various web properties.  Our use
>>> >> case
>>> >> >> >> requires consumption of this stream by many heterogeneous
>>> consumers
>>> >> >> >> including batch (Hadoop) and real-time (Storm).  We pointed out
>>> that
>>> >> >> Kafka
>>> >> >> >> acts as a configurable rolling window of time on the activity
>>> stream.
>>> >> >>  The
>>> >> >> >> window default is 7 days which allows for supporting clients of
>>> >> >> different
>>> >> >> >> latencies like Hadoop and Storm to read from the same stream.
>>> >> >> >>
>>> >> >> >> We pointed out that the Kafka brokers don't need to maintain
>>> consumer
>>> >> >> state
>>> >> >> >> in the stream and only have to maintain one copy of the stream to
>>> >> >> support N
>>> >> >> >> number of consumers.  Rabbit brokers on the other hand have to
>>> >> maintain
>>> >> >> the
>>> >> >> >> state of each consumer as well as create a copy of the stream for
>>> >> each
>>> >> >> >> consumer.  In our scenario we have 10-20 consumers and with the
>>> scale
>>> >> >> and
>>> >> >> >> throughput of the activity stream we were able to show Rabbit
>>> quickly
>>> >> >> >> becomes the bottleneck under load.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Thu, Jun 6, 2013 at 12:40 PM, Dragos Manolescu <
>>> >> >> >> dragos.manole...@servicenow.com> wrote:
>>> >> >> >>
>>> >> >> >> > Hi --
>>> >> >> >> >
>>> >> >> >> > I am preparing to make a case for using Kafka instead of Rabbit
>>> MQ
>>> >> as
>>> >> >> a
>>> >> >> >> > broker-based messaging provider. The context is similar to that
>>> of
>>> >> the
>>> >> >> >> > Kafka papers and user stories: the producers publish monitoring
>>> >> data
>>> >> >> and
>>> >> >> >> > logs, and a suite of subscribers consume this data (some store
>>> it,
>>> >> >> others
>>> >> >> >> > perform computations on the event stream). The requirements are
>>> >> >> typical
>>> >> >> >> of
>>> >> >> >> > this context: low-latency, high-throughput, ability to deal with
>>> >> >> bursts
>>> >> >> >> and
>>> >> >> >> > operate in/across multiple data centers, etc.
>>> >> >> >> >
>>> >> >> >> > I am familiar with the performance comparison between Kafka,
>>> >> Rabbit MQ
>>> >> >> >> and
>>> >> >> >> > Active MQ from the NetDB 2011 paper<
>>> >> >> >> >
>>> >> >> >>
>>> >> >>
>>> >>
>>> http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
>>> >> >> >> >.
>>> >> >> >> > However in the two years that passed since then the number of
>>> >> >> production
>>> >> >> >> > Kafka installations increased, and people are using it in
>>> different
>>> >> >> ways
>>> >> >> >> > than those imagined by Kafka's designers. In light of these
>>> >> >> experiences
>>> >> >> >> one
>>> >> >> >> > can use more data points and color when contrasting to Rabbit MQ
>>> >> >> (which
>>> >> >> >> by
>>> >> >> >> > the way also evolved since 2011). (And FWIW I know I am not the
>>> >> first
>>> >> >> one
>>> >> >> >> > to walk this path; see for example last year's OSCON session on
>>> the
>>> >> >> State
>>> >> >> >> > of MQ<http://lanyrd.com/2012/oscon/swrcz/>.)
>>> >> >> >> >
>>> >> >> >> > I would appreciate it if you could share measurements, results,
>>> or
>>> >> >> even
>>> >> >> >> > anecdotal evidence along these lines. How have you avoided the
>>> >> "let's
>>> >> >> use
>>> >> >> >> > Rabbit MQ because everybody else does it" route when solving
>>> >> problems
>>> >> >> for
>>> >> >> >> > which Kafka is a better fit?
>>> >> >> >> >
>>> >> >> >> > Thanks,
>>> >> >> >> >
>>> >> >> >> > -Dragos
>>> >> >> >> >
>>> >> >> >>
>>> >> >>
>>> >>
>>>

Re: Arguments for Kafka over RabbitMQ ?

Reply via email to