Hey Susan, As far as I know, there is very minimal differences between Partition vs Topic strategy in terms of performance - in terms of how they are allocated in the memory they should be very similar, but I'll get some Kafka experts to comment on that.
>From Samza's perspective, if you choose to go with multiple partitions. You can write a Samza job which will repartition the stream as exactly you described, peek into the clientID from the stream event and send it to the corresponding partition [1]. You can have a second job, with each Task processing information from one partition (which will correspond to events from one clientID). In the implementation, there will be a one-to-one mapping between the Task and the Partition. [1] http://samza.apache.org/learn/documentation/0.9/api/javadocs/org/apache/samza/system/OutgoingMessageEnvelope.html Thanks, Naveen On Fri, Apr 24, 2015 at 3:29 PM, Susan Luong <susan...@gmail.com> wrote: > Hi there, I'm new to Samza/Kafka and we're evaluating Samza to see whether > it would be a good fit for our application. I just had a few questions > about how partitioning works. > > I understand there is a limitation on the number of topics we can create > [1], and I was wondering, if we need more than, say 10K topics, would it be > a better idea to use partitioning instead? or would the same limits apply? > i.e. would having 1 topic with 10k partitions produce the same performance > issues as having 10k topics with 1 partition each? > > If we can overcome the topics limitation by creating more partitions, we'd > like to be able to divide up our stream messages by client ID. is it > possible to group partitions so that we have a set of partitions that > contain data from a certain client and another set of partitions for > another client, within the same topic? > > For example, we might have a stream partition 'A' (for clientID A) and a > corresponding task 'a' that processes messages from partition 'A', and a > partition B (for client B) and a corresponding task, 'b' that processes > messages from stream partition 'B'. Our problem though, is that, we'd like > for task 'a' to only process messages from stream A and never from stream > B, since task 'a' may contain local state that applies specifically to > stream A. Would this be possible? > > Maybe I'm not understanding how Samza works, but I'm hoping someone can > help me clarify. Thanks in advance for your help. > > Susan > > > > [1] > http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic >