Hi Kishore

In general I think it’s up to you to choose keys that keep related data 
together, but also give you reasonable load balancing. I’m afraid that I’m not 
sure I fully followed your explanation of how storm solves this problem more 
efficiently though.

I noticed you asked:  "How would this work for a multi-tenant stream processing 
where people want to write multiple stream jobs on the same set of data?” - I 
think this is simply that Consumer Group behaviour. Different applications 
would get different consumer groups (application ids in KStreams), giving them 
independent parallelism over the same data. 

In theory there is another knob to consider. Consumers (actually the leader 
consumer) can control which partitions they get assigned. KStreams already uses 
this feature to do things like create stand by replicas, but I don’t think (but 
I may be wrong) this helps you with your problem directly.  

All the best

B 

> On 21 Mar 2016, at 03:12, Kishore Senji <kse...@gmail.com> wrote:
> 
> I will scale back the question to get some replies :)
> 
> Suppose the use-case is to build a monitoring platform -
> For log aggregation from thousands of nodes, I believe that a Kafka topic
> should be partitioned n-ways and the data should be sprayed in a
> round-robin fashion to get a good even distribution of data (because we
> don't know upfront how the data is sliced by semantically and we don't know
> whether the key for semantic partitioning gives a even distribution of
> data). Later in stream processing, the appropriate group-bys would be done
> on the same source of data to support various ways of slicing.
> 
> 
> http://kafka.apache.org/documentation.html#design_loadbalancing - "This can
> be done at random, implementing a kind of random load balancing, or it can
> be done by some semantic partitioning function"
> http://kafka.apache.org/documentation.html#basic_ops_modify_topic - "Be
> aware that one use case for partitions is to semantically partition data,
> and adding partitions doesn't change the partitioning of existing data so
> this may disturb consumers if they rely on that partition"
> 
> The above docs caution the use of semantic partitioning as it can lead to
> uneven distribution (hotspots) if the semantic key does not give even
> distribution, plus on a flex up of partitions the data would now be in two
> partitions. For these reasons, I strongly believe the data should be pushed
> to Kafka in a round-robin fashion and later a Stream processing framework
> should use the appropriate group-bys (this also gives us the flexibility to
> slice in different ways as well at runtime)
> 
> KStreams let us do stream processing on a partition of data. So to do
> windowed aggregation, the data for the same key should be in the same
> partition. This means to use KStreams we have to use Semantic partitioning,
> which will have the above issues as shown in Kafka docs. So my question is -
> 
> If we want to use KStreams how should we deal with "Load balancing" (it can
> happen that the semantic partitioning can overload a single partition and
> so Kafka partition will be overloaded as well as the KStream task)  and
> "Flex up of partitions" (more than one partition will have data for a given
> key and so the windowed aggregations result in incorrect data)?
> 
> Thanks,
> Kishore.
> 
> On Thu, Mar 17, 2016 at 4:28 PM, Kishore Senji <kse...@gmail.com> wrote:
> 
>> Hi All,
>> 
>> I read through the doc on KStreams here:
>> http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
>> <http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>
>> 
>> I was wondering about how an use-case that I have be solved with KStream?
>> 
>> Use-case: Logs are obtained from a service pool. It contains many nodes.
>> We want to alert if a particular consumer (identified by consumer id) is
>> making calls to the service more than X number of times in the last 1 min.
>> The data is available in logs similar to access logs for example. The
>> window is a sliding window. We have to check back 60s from the current
>> event and see if in total (with the current event) it would exceed the
>> threshold. Logs are pushed to Kafka using a random partitioner whose range
>> is [1 to n] where n is the total number of partitions.
>> 
>> One way of achieving this is to push data in to the first Kafka topic
>> (using random partitioning) and then a set of KStream tasks re-shuffling
>> the data on consumer_id in to the second topic. The next set of KStream
>> tasks operate on the second topic (1 task/partition) and do the
>> aggregation. If this is an acceptable solution, here are my questions on
>> scaling.
>> 
>> 
>>   - I can see that the second topic is prone to hotspots. If we get
>>   billions of requests for a given consumer_id and only few hundreds for
>>   another consumer_id, the second kafka topic partitions will become hotspots
>>   (and the partition getting lot of volume of logs can suffocate other
>>   partitions on the same broker). If we try to create more partitions and
>>   probably isolate the partition getting lot of volume, this wastes 
>> resources.
>>   - The max parallelism that we can get for a KStream task is the number
>>   of partitions - this may work for a single stream. How would this work for
>>   a multi-tenant stream processing where people want to write multiple stream
>>   jobs on the same set of data? If the parallelism does not work, they would
>>   have to copy and group the data in to another topic with more partitions. I
>>   think like we need two knobs one for scaling Kafka (number of partitions)
>>   and one for scaling stream. It sounds like with KStream it is only one knob
>>   for both.
>>   - How would we deal with organic growth of data? Let us say the
>>   partitions we chose for the second topic (where it is grouped by
>>   consumer_id) is not enough to deal with organic growth in volume. If we
>>   increase partitions, for a given consumer some data could be in one
>>   partition before the flex up and data could end up in a different topic
>>   after flex up. Since the KStream jobs are unique per partition and are
>>   stateless across them, the aggregated result would be incorrect, unless we
>>   have only one job to read all the data in which case it will become a
>>   bottleneck.
>> 
>> In storm (or any other streaming engine), the way to solve it would be to
>> have only one topic (partitioned n-ways) and data pushed in to using a
>> random partitioner (so no hotspots and scaling issues). We will have n
>> Spouts reading data from those partitions and we can then have m bolts
>> getting the data using fields grouping on consumer_id. Since all the data
>> for a given consumer_id ends up in a bolt we will do the sliding window and
>> the alert.
>> 
>> If we solve it the Storm way in KStream, we would only have one topic 
>> (partitioned
>> n-ways) and data pushed in to using a random partitioner (so no hotspots
>> and scaling issues). But we can only have one KStream task running reading
>> all the data and doing the windowing and aggregation. This will become a
>> bottleneck for scaling.
>> 
>> So it sounds like KStreams will either have "hotspots" in kafka topic (as
>> each partition needs to have the data that the KStream task needs and work
>> independently) or scaling issues in the KStream task for "aggregation".
>> 
>> How would one solve this kind of problems with KStream?
>> 
>> 
>> 
>> 
>> <https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>
>> 

Reply via email to