Re: [DISCUSS] New partitioning for better load balancing

Gianmarco De Francisci Morales Sun, 05 Apr 2015 00:20:47 -0700

Hi Jay,

Thanks, that sounds a necessary step. I guess I expected something like
that to be already there, at least internally.
I created KAFKA-2092 to track the PKG integration.


Cheers,

--
Gianmarco

On 4 April 2015 at 23:50, Jay Kreps <jay.kr...@gmail.com> wrote:

> Hey guys,
>
> I think the first step here would be to expose a partitioner interface for
> the new producer that would make it easy to plug in these different
> strategies. I filed a JIRA for this:
> https://issues.apache.org/jira/browse/KAFKA-2091
>
> -Jay
>
> On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote:
>
>> Gianmarco,
>>                  I am coming from storm community. I think PKG is a very
>> interesting and we can provide an implementation of Partitioner for PKG.
>> Can you open a JIRA for this.
>>
>> --
>> Harsha
>> Sent with Airmail
>>
>> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
>> g...@apache.org) wrote:
>>
>> Hi,
>>
>> We have recently studied the problem of load balancing in distributed
>> stream processing systems such as Samza [1].
>> In particular, we focused on what happens when the key distribution of the
>> stream is skewed when using key grouping.
>> We developed a new stream partitioning scheme (which we call Partial Key
>> Grouping). It achieves better load balancing than hashing while being more
>> scalable than round robin in terms of memory.
>>
>> In the paper we show a number of mining algorithms that are easy to
>> implement with partial key grouping, and whose performance can benefit
>> from
>> it. We think that it might also be useful for a larger class of
>> algorithms.
>>
>> PKG has already been integrated in Storm [2], and I would like to be able
>> to use it in Samza as well. As far as I understand, Kafka producers are
>> the
>> ones that decide how to partition the stream (or Kafka topic). Even after
>> doing a bit of reading, I am still not sure if I should be writing this
>> email here or on the Samza dev list. Anyway, my first guess is Kafka.
>>
>> I do not have experience with Kafka, however partial key grouping is very
>> easy to implement: it requires just a few lines of code in Java when
>> implemented as a custom grouping in Storm [3].
>> I believe it should be very easy to integrate.
>>
>> For all these reasons, I believe it will be a nice addition to
>> Kafka/Samza.
>> If the community thinks it's a good idea, I will be happy to offer support
>> in the porting.
>>
>> References:
>> [1]
>>
>> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>> [2] https://issues.apache.org/jira/browse/STORM-632
>> [3] https://github.com/gdfm/partial-key-grouping
>> --
>> Gianmarco
>>
>
>

Re: [DISCUSS] New partitioning for better load balancing

Reply via email to