Re: Questions about partitioning

Naveen S Fri, 24 Apr 2015 16:41:45 -0700

Hey Susan,
                 As far as I know, there is very minimal differences
between Partition vs Topic strategy in terms of performance - in terms of
how they are allocated in the memory they should be very similar, but I'll
get some Kafka experts to comment on that.


>From Samza's perspective, if you choose to go with multiple partitions. You
can write a Samza job which will repartition the stream as exactly you
described, peek into the clientID from the stream event and send it to the
corresponding partition [1]. You can have a second job, with each Task
processing information from one partition (which will correspond to events
from one clientID). In the implementation, there will be a one-to-one
mapping between the Task and the Partition.


[1]
http://samza.apache.org/learn/documentation/0.9/api/javadocs/org/apache/samza/system/OutgoingMessageEnvelope.html

Thanks,
Naveen

On Fri, Apr 24, 2015 at 3:29 PM, Susan Luong <susan...@gmail.com> wrote:

> Hi there, I'm new to Samza/Kafka and we're evaluating Samza to see whether
> it would be a good fit for our application. I just had a few questions
> about how partitioning works.
>
> I understand there is a limitation on the number of topics we can create
> [1], and I was wondering, if we need more than, say 10K topics, would it be
> a better idea to use partitioning instead? or would the same limits apply?
> i.e. would having 1 topic with 10k partitions produce the same performance
> issues as having 10k topics with 1 partition each?
>
> If we can overcome the topics limitation by creating more partitions, we'd
> like to be able to divide up our stream messages by client ID. is it
> possible to group partitions so that we have a set of partitions that
> contain data from a certain client and another set of partitions for
> another client, within the same topic?
>
> For example, we might have a stream partition 'A' (for clientID A) and a
> corresponding task 'a' that processes messages from partition 'A', and a
> partition B (for client B) and a corresponding task, 'b' that processes
> messages from stream partition 'B'. Our problem though, is that, we'd like
> for task 'a' to only process messages from stream A and never from stream
> B, since task 'a' may contain local state that applies specifically to
> stream A. Would this be possible?
>
> Maybe I'm not understanding how Samza works, but I'm hoping someone can
> help me clarify. Thanks in advance for your help.
>
> Susan
>
>
>
> [1]
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
>

Re: Questions about partitioning

Reply via email to