So I disagree with the idea to use custom partitioning, depending on your
requirements. Having a consumer consume from a single partition is not
(currently) that easy. If you don't care which consumer gets which
partition (group), then it's not that bad. You have 20 partitions, you have
20 consumers, and you use custom partitioning as noted. The consumers use
the high level consumer with a single group, each one will get one
partition each, and it's pretty straightforward. If a consumer crashes, you
will end up with two partitions on one of the remaining consumers. If this
is OK, this is a decent solution.

If, however, you require that each consumer always have the same group of
data, and you need to know what that group is beforehand, it's more
difficult. You need to use the simple consumer to do it, which means you
need to implement a lot of logic for error and status code handling
yourself, and do it right. In this case, I think your idea of using 400
separate topics is sound. This way you can still use the high level
consumer, which takes care of the error handling for you, and your data is
separated out by topic.

Provided it is not an issue to implement it in your producer, I would go
with the separate topics. Alternately, if you're not sure you always want
separate topics, you could go with something similar to your second idea,
but have a consumer read the single topic and split the data out into 400
separate topics in Kafka (no need for Cassandra or Redis or anything else).
Then your real consumers can all consume their separate topics. Reading and
writing the data one extra time is much better than rereading all of it 400
times and throwing most of it away.

-Todd


On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford <b...@confluent.io> wrote:

> Hi Shaun
>
> You might consider using a custom partition assignment strategy to push
> your different “groups" to different partitions. This would allow you walk
> the middle ground between "all consumers consume everything” and “one topic
> per consumer” as you vary the number of partitions in the topic, albeit at
> the cost of a little extra complexity.
>
> Also, not sure if you’ve seen it but there is quite a good section in the
> FAQ here <
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave?>
> on topic and partition sizing.
>
> B
>
> > On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.sene...@lithium.com>
> wrote:
> >
> > Hi
> >
> >
> > I heave read Jay Kreps post regarding the number of topics that can be
> handled by a broker (
> https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka),
> and it has left me with more questions that I dont see answered anywhere
> else.
> >
> >
> > We have a data stream which will be consumed by many consumers (~400).
> We also have many "groups" within our data.  A group in the data
> corresponds 1:1 with what the consumers would consume, so consumer A only
> ever see group A messages, consumer B only consumes group B messages, etc.
> >
> >
> > The downstream consumers will be consuming via a websocket API, so the
> API server will be the thing consuming from kafka.
> >
> >
> > If I use a single topic with, say, 20 partitions, the consumers in the
> API server would need to re-read the same messages over and over for each
> consumer, which seems like a waste of network and a potential bottleneck.
> >
> >
> > Alternatively, I could use a single topic with 20 partitions and have a
> single consumer in the API put the messages into cassandra/redis (as
> suggested by Jay), and serve out the downstream consumer streams that way.
> However, that requires using a secondary sorted storage, which seems like a
> waste (and added complexity) given that Kafka already has the data exactly
> as I need it.  Especially if cassandra/redis are required to maintain a
> long TTL on the stream.
> >
> >
> > Finally, I could use 1 topic per group, each with a single partition.
> This would result in 400 topics on the broker, but would allow the API
> server to simply serve the stream for each consumer directly from kafka and
> wont require additional machinery to serve out the requests.
> >
> >
> > The 400 topic solution makes the most sense to me (doesnt require extra
> services, doesnt waste resources), but seem to conflict with best
> practices, so I wanted to ask the community for input.  Has anyone done
> this before?  What makes the most sense here?
> >
> >
> >
> >
> > Thanks
> >
> >
> > Shaun
>
>

Reply via email to