TL;DR: Any experience with overlaying a “single queue multi server” facade onto 
a Kafka topic?

(I’m new to the list and I tried searching for an answer prior to spamming 
everyone with this idea - sorry if I missed the answer in that search —ron)

I'm curious if anyone has run into (and solved) a similar problem to one that 
I'm currently seeing. 

We are using Kafka in a very straightforward fashion: a pool of producers that 
pump messages into a topic (with say 40 partitions) with a uniform distribution 
of messages across partitions with a pool of (say 20) uniformly configured 
consumers. All of the messages (and the work associated with processing a 
message) is independent and fungible: there is no need for a particular message 
to be processed by a particular consumer.

When everything is happy, these uniformly configured consumers process all the 
messages. When things are sad, and in particular when some subset of consumers 
are performing poorly, they lag on their assigned partitions. And none of the 
other consumers, but particularly the ones that are still just chugging along, 
can help.

But none of this is Kafka’s fault - all of this is both well understood and 
intentional in the design of Kafka. The work is partitioned and it is the 
responsibility of the consumers to keep up. My concern, though, is that there 
isn’t very much I can do about it when it (it == laggy consumer) happens. In 
particular, I see a handful of actions that I could take:
a) I can add more consumers (to say 40, 1 per partition). But this is 
insufficient - it doesn’t resolve the problem when a consumer can’t keep up 
with 1 partition.
b) I can [statically] configure how many partitions to assign to a consumer. 
This is insufficient, same reason as above.
c) I can ensure that my consumers have sufficient capability to process at 
least 1 partition - this is a nice thought, but my consumers’ performance are 
often influenced external factors (e.g., database slowdowns) that impact their 
ability to consume and are difficult to predict ahead of time; further, these 
external factors tend to impact certain consumers more than others.

One thought not on that list is to give Kafka a “single queue multi server” 
facade - each of the consumers requests work (from the topic, not a particular 
partition) at the rate at which they can process it, and more capable consumers 
will naturally process more things. If the system lags, we can add more 
consumers, even weak ones, to help get the work done. This would give Kafka the 
properties of a single M/M/m queueing network. (For the record, I prefer to 
model the general case of a Kafka-based system as a collection of m independent 
M/M/1 queueing networks, perhaps with arrival rates proportional to the number 
of partitions assigned to the consumers)

This, much like c) above, is a nice thought but is pretty challenging. This 
facade needs to process all the partitions and dole out the work as it’s 
requested. It needs to be able to manage offsets in an unnatural-to-Kafka way. 
It needs to handle the partitions somewhat uniformly as well - it can’t just 
pull work from one partition then the next partition (or can it?). It needs to 
be lightweight (in terms of overhead). And it needs to be robust.

All that aside, has anyone managed to make such a facade work and have it 
process at a significantly high rate (in my hypothetical above, I need 
100K/second cumulative across those 40 partitions - about 2.5K/sec/partition or 
5K/sec/consumer)? If you have thoughts about this, feel free to share. If you 
think this is silly, that I’m not Kafka-ing the right way or there’s a brutally 
obvious solution that I’ve just missed share that too.

Thanks!

Ron C.

Reply via email to