I agree. The only reason I can think of for the custom partitioning route would 
be if your group concept were to grow to a point where a topic-per-category 
strategy become prohibitive. This seems unlikely based on what you’ve said. I 
should also add that Todd is spot on regarding the SimpleConsumer not being 
something you’d want to pursue at this time. There is however a new consumer on 
trunk which makes these things a little easier. 


> On 30 Sep 2015, at 19:05, Pradeep Gollakota <pradeep...@gmail.com> wrote:
> 
> To add a little more context to Shaun's question, we have around 400
> customers. Each customer has a stream of events. Some customers generate a
> lot of data while others don't. We need to ensure that each customer's data
> is sorted globally by timestamp.
> 
> We have two use cases around consumption:
> 
> 1. A user may consume an individual customers data
> 2. A user may consume data for all customers
> 
> Given these two use cases, I think the better strategy is to have a
> separate topic per customer as Todd suggested.
> 
> On Wed, Sep 30, 2015 at 9:26 AM, Todd Palino <tpal...@gmail.com> wrote:
> 
>> So I disagree with the idea to use custom partitioning, depending on your
>> requirements. Having a consumer consume from a single partition is not
>> (currently) that easy. If you don't care which consumer gets which
>> partition (group), then it's not that bad. You have 20 partitions, you have
>> 20 consumers, and you use custom partitioning as noted. The consumers use
>> the high level consumer with a single group, each one will get one
>> partition each, and it's pretty straightforward. If a consumer crashes, you
>> will end up with two partitions on one of the remaining consumers. If this
>> is OK, this is a decent solution.
>> 
>> If, however, you require that each consumer always have the same group of
>> data, and you need to know what that group is beforehand, it's more
>> difficult. You need to use the simple consumer to do it, which means you
>> need to implement a lot of logic for error and status code handling
>> yourself, and do it right. In this case, I think your idea of using 400
>> separate topics is sound. This way you can still use the high level
>> consumer, which takes care of the error handling for you, and your data is
>> separated out by topic.
>> 
>> Provided it is not an issue to implement it in your producer, I would go
>> with the separate topics. Alternately, if you're not sure you always want
>> separate topics, you could go with something similar to your second idea,
>> but have a consumer read the single topic and split the data out into 400
>> separate topics in Kafka (no need for Cassandra or Redis or anything else).
>> Then your real consumers can all consume their separate topics. Reading and
>> writing the data one extra time is much better than rereading all of it 400
>> times and throwing most of it away.
>> 
>> -Todd
>> 
>> 
>> On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford <b...@confluent.io> wrote:
>> 
>>> Hi Shaun
>>> 
>>> You might consider using a custom partition assignment strategy to push
>>> your different “groups" to different partitions. This would allow you
>> walk
>>> the middle ground between "all consumers consume everything” and “one
>> topic
>>> per consumer” as you vary the number of partitions in the topic, albeit
>> at
>>> the cost of a little extra complexity.
>>> 
>>> Also, not sure if you’ve seen it but there is quite a good section in the
>>> FAQ here <
>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave
>> ?>
>>> on topic and partition sizing.
>>> 
>>> B
>>> 
>>>> On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.sene...@lithium.com>
>>> wrote:
>>>> 
>>>> Hi
>>>> 
>>>> 
>>>> I heave read Jay Kreps post regarding the number of topics that can be
>>> handled by a broker (
>>> https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka),
>>> and it has left me with more questions that I dont see answered anywhere
>>> else.
>>>> 
>>>> 
>>>> We have a data stream which will be consumed by many consumers (~400).
>>> We also have many "groups" within our data.  A group in the data
>>> corresponds 1:1 with what the consumers would consume, so consumer A only
>>> ever see group A messages, consumer B only consumes group B messages,
>> etc.
>>>> 
>>>> 
>>>> The downstream consumers will be consuming via a websocket API, so the
>>> API server will be the thing consuming from kafka.
>>>> 
>>>> 
>>>> If I use a single topic with, say, 20 partitions, the consumers in the
>>> API server would need to re-read the same messages over and over for each
>>> consumer, which seems like a waste of network and a potential bottleneck.
>>>> 
>>>> 
>>>> Alternatively, I could use a single topic with 20 partitions and have a
>>> single consumer in the API put the messages into cassandra/redis (as
>>> suggested by Jay), and serve out the downstream consumer streams that
>> way.
>>> However, that requires using a secondary sorted storage, which seems
>> like a
>>> waste (and added complexity) given that Kafka already has the data
>> exactly
>>> as I need it.  Especially if cassandra/redis are required to maintain a
>>> long TTL on the stream.
>>>> 
>>>> 
>>>> Finally, I could use 1 topic per group, each with a single partition.
>>> This would result in 400 topics on the broker, but would allow the API
>>> server to simply serve the stream for each consumer directly from kafka
>> and
>>> wont require additional machinery to serve out the requests.
>>>> 
>>>> 
>>>> The 400 topic solution makes the most sense to me (doesnt require extra
>>> services, doesnt waste resources), but seem to conflict with best
>>> practices, so I wanted to ask the community for input.  Has anyone done
>>> this before?  What makes the most sense here?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> Shaun
>>> 
>>> 
>> 

Reply via email to