> Actually you don't need 100s GBs to reap the benefits of Kafka over
Rabbit.
> Because Kafka doesn't centrally maintain state it can always manage higher
> message throughput more efficiently than Rabbit even when there is no
> messages persisted to disk.
>

Just out of curiosity, how does Kafka know when to remove/delete messages
from disk? Is this just done whenever a messages "falls off the end of the
(circular) buffer" or is there more to it than that? Also, when you say
that Kafka doesn't centrally maintain state (at all), does that mean
clients maintain their view of where (in the server held buffer) they're
currently at - kind of client-side cursor to the data? How does this
translate into no random I/O - you can't have mapped the entire
multi-terrabyte sized store into memory using mmap, so does this simply
mean that when that particular client is consuming data, you're relying on
the OS to page in the relevant bits of the data store and relying on
sendfile (under the covers) to flush that to the socket? Have I understood
this correctly? Sorry, BTW, if these are RTFM questions - I saw some bits
in the docs, but I must admit I've not trawled the code for answers as yet.

Kafka has a configurable rolling window of time it keeps the messages per
topic.  The default is 7 days and after this time the messages are removed
from disk by the broker.
Correct, the consumers maintain their own state via what are known as
offsets.  Also true that when producers/consumers contact the broker there
is a random seek to the start of the offset, but the majority of access
patterns are linear.


> As you can see in the last graph of 10 million messages which is less than
> a GB on disk, the Rabbit throughput is capped around 10k/sec.  Beyond
> throughput, with the pending release of 0.8, Kafka will also have
> advantages around message guarantees and durability.
>

Fascinating. What are those guarantees going to be? One of the reasons
Rabbit runs a bit slower - one of several - when persisting data, is that
each write it fsync'ed to disk, whereas kafka relies on OS level flushing
IIRC, providing a configurable parameter to force a flush after some
defined number of messages, so as to avoid too much potential data loss in
case of server failure. So in that respect, Rabbit has a highly guarantee
of durability in its current incarnation, with the obvious caveats that
doing so has an adverse affect on performance.

When you say "message guarantees", are we talking about ordering, or
delivery, or both? Very interested to hear about those.

Correct with 0.8 Kafka will have similar options like Rabbit fsync
configuration option.  Messages have always had ordering guarantees, but
with 0.8 there is the notion of topic replicas similar to replication
factor in Hadoop or Cassandra.

http://www.slideshare.net/junrao/kafka-replication-apachecon2013

With configuration you can tradeoff latency for durability with 3 options.
  - Producer receives no acks (no network delay)
  - Producer waits for ack from broker leader (1 network roundtrip)
  - Producer waits for quorum ack (2 network roundtrips)

With the combination of quorum commits and consumers managing state you can
get much closer to exactly once guarantees i.e. the consumers can manage
their consumption state as well as the consumed messages in the same
transaction.



On Mon, Jun 10, 2013 at 6:40 AM, Tim Watson <watson.timo...@gmail.com>wrote:

> Hi Jonathan,
>
> Cheers,
> Tim
>
> On 10 Jun 2013, at 13:12, Jonathan Hodges wrote:
>
> > Actually you don't need 100s GBs to reap the benefits of Kafka over
> Rabbit.
> > Because Kafka doesn't centrally maintain state it can always manage
> higher
> > message throughput more efficiently than Rabbit even when there is no
> > messages persisted to disk.
> >
>
> Just out of curiosity, how does Kafka know when to remove/delete messages
> from disk? Is this just done whenever a messages "falls off the end of the
> (circular) buffer" or is there more to it than that? Also, when you say
> that Kafka doesn't centrally maintain state (at all), does that mean
> clients maintain their view of where (in the server held buffer) they're
> currently at - kind of client-side cursor to the data? How does this
> translate into no random I/O - you can't have mapped the entire
> multi-terrabyte sized store into memory using mmap, so does this simply
> mean that when that particular client is consuming data, you're relying on
> the OS to page in the relevant bits of the data store and relying on
> sendfile (under the covers) to flush that to the socket? Have I understood
> this correctly? Sorry, BTW, if these are RTFM questions - I saw some bits
> in the docs, but I must admit I've not trawled the code for answers as yet.
>
> > As you can see in the last graph of 10 million messages which is less
> than
> > a GB on disk, the Rabbit throughput is capped around 10k/sec.  Beyond
> > throughput, with the pending release of 0.8, Kafka will also have
> > advantages around message guarantees and durability.
> >
>
> Fascinating. What are those guarantees going to be? One of the reasons
> Rabbit runs a bit slower - one of several - when persisting data, is that
> each write it fsync'ed to disk, whereas kafka relies on OS level flushing
> IIRC, providing a configurable parameter to force a flush after some
> defined number of messages, so as to avoid too much potential data loss in
> case of server failure. So in that respect, Rabbit has a highly guarantee
> of durability in its current incarnation, with the obvious caveats that
> doing so has an adverse affect on performance.
>
> When you say "message guarantees", are we talking about ordering, or
> delivery, or both? Very interested to hear about those.
>
> Cheers,
> Tim

Reply via email to