I agree. The only reason I can think of for the custom partitioning route would be if your group concept were to grow to a point where a topic-per-category strategy become prohibitive. This seems unlikely based on what you’ve said. I should also add that Todd is spot on regarding the SimpleConsumer not being something you’d want to pursue at this time. There is however a new consumer on trunk which makes these things a little easier.
> On 30 Sep 2015, at 19:05, Pradeep Gollakota <pradeep...@gmail.com> wrote: > > To add a little more context to Shaun's question, we have around 400 > customers. Each customer has a stream of events. Some customers generate a > lot of data while others don't. We need to ensure that each customer's data > is sorted globally by timestamp. > > We have two use cases around consumption: > > 1. A user may consume an individual customers data > 2. A user may consume data for all customers > > Given these two use cases, I think the better strategy is to have a > separate topic per customer as Todd suggested. > > On Wed, Sep 30, 2015 at 9:26 AM, Todd Palino <tpal...@gmail.com> wrote: > >> So I disagree with the idea to use custom partitioning, depending on your >> requirements. Having a consumer consume from a single partition is not >> (currently) that easy. If you don't care which consumer gets which >> partition (group), then it's not that bad. You have 20 partitions, you have >> 20 consumers, and you use custom partitioning as noted. The consumers use >> the high level consumer with a single group, each one will get one >> partition each, and it's pretty straightforward. If a consumer crashes, you >> will end up with two partitions on one of the remaining consumers. If this >> is OK, this is a decent solution. >> >> If, however, you require that each consumer always have the same group of >> data, and you need to know what that group is beforehand, it's more >> difficult. You need to use the simple consumer to do it, which means you >> need to implement a lot of logic for error and status code handling >> yourself, and do it right. In this case, I think your idea of using 400 >> separate topics is sound. This way you can still use the high level >> consumer, which takes care of the error handling for you, and your data is >> separated out by topic. >> >> Provided it is not an issue to implement it in your producer, I would go >> with the separate topics. Alternately, if you're not sure you always want >> separate topics, you could go with something similar to your second idea, >> but have a consumer read the single topic and split the data out into 400 >> separate topics in Kafka (no need for Cassandra or Redis or anything else). >> Then your real consumers can all consume their separate topics. Reading and >> writing the data one extra time is much better than rereading all of it 400 >> times and throwing most of it away. >> >> -Todd >> >> >> On Wed, Sep 30, 2015 at 9:06 AM, Ben Stopford <b...@confluent.io> wrote: >> >>> Hi Shaun >>> >>> You might consider using a custom partition assignment strategy to push >>> your different “groups" to different partitions. This would allow you >> walk >>> the middle ground between "all consumers consume everything” and “one >> topic >>> per consumer” as you vary the number of partitions in the topic, albeit >> at >>> the cost of a little extra complexity. >>> >>> Also, not sure if you’ve seen it but there is quite a good section in the >>> FAQ here < >>> >> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowmanytopicscanIhave >> ?> >>> on topic and partition sizing. >>> >>> B >>> >>>> On 29 Sep 2015, at 18:48, Shaun Senecal <shaun.sene...@lithium.com> >>> wrote: >>>> >>>> Hi >>>> >>>> >>>> I heave read Jay Kreps post regarding the number of topics that can be >>> handled by a broker ( >>> https://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka), >>> and it has left me with more questions that I dont see answered anywhere >>> else. >>>> >>>> >>>> We have a data stream which will be consumed by many consumers (~400). >>> We also have many "groups" within our data. A group in the data >>> corresponds 1:1 with what the consumers would consume, so consumer A only >>> ever see group A messages, consumer B only consumes group B messages, >> etc. >>>> >>>> >>>> The downstream consumers will be consuming via a websocket API, so the >>> API server will be the thing consuming from kafka. >>>> >>>> >>>> If I use a single topic with, say, 20 partitions, the consumers in the >>> API server would need to re-read the same messages over and over for each >>> consumer, which seems like a waste of network and a potential bottleneck. >>>> >>>> >>>> Alternatively, I could use a single topic with 20 partitions and have a >>> single consumer in the API put the messages into cassandra/redis (as >>> suggested by Jay), and serve out the downstream consumer streams that >> way. >>> However, that requires using a secondary sorted storage, which seems >> like a >>> waste (and added complexity) given that Kafka already has the data >> exactly >>> as I need it. Especially if cassandra/redis are required to maintain a >>> long TTL on the stream. >>>> >>>> >>>> Finally, I could use 1 topic per group, each with a single partition. >>> This would result in 400 topics on the broker, but would allow the API >>> server to simply serve the stream for each consumer directly from kafka >> and >>> wont require additional machinery to serve out the requests. >>>> >>>> >>>> The 400 topic solution makes the most sense to me (doesnt require extra >>> services, doesnt waste resources), but seem to conflict with best >>> practices, so I wanted to ask the community for input. Has anyone done >>> this before? What makes the most sense here? >>>> >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> Shaun >>> >>> >>