Hi Jay, Thanks, that sounds a necessary step. I guess I expected something like that to be already there, at least internally. I created KAFKA-2092 to track the PKG integration.
Cheers, -- Gianmarco On 4 April 2015 at 23:50, Jay Kreps <jay.kr...@gmail.com> wrote: > Hey guys, > > I think the first step here would be to expose a partitioner interface for > the new producer that would make it easy to plug in these different > strategies. I filed a JIRA for this: > https://issues.apache.org/jira/browse/KAFKA-2091 > > -Jay > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote: > >> Gianmarco, >> I am coming from storm community. I think PKG is a very >> interesting and we can provide an implementation of Partitioner for PKG. >> Can you open a JIRA for this. >> >> -- >> Harsha >> Sent with Airmail >> >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales ( >> g...@apache.org) wrote: >> >> Hi, >> >> We have recently studied the problem of load balancing in distributed >> stream processing systems such as Samza [1]. >> In particular, we focused on what happens when the key distribution of the >> stream is skewed when using key grouping. >> We developed a new stream partitioning scheme (which we call Partial Key >> Grouping). It achieves better load balancing than hashing while being more >> scalable than round robin in terms of memory. >> >> In the paper we show a number of mining algorithms that are easy to >> implement with partial key grouping, and whose performance can benefit >> from >> it. We think that it might also be useful for a larger class of >> algorithms. >> >> PKG has already been integrated in Storm [2], and I would like to be able >> to use it in Samza as well. As far as I understand, Kafka producers are >> the >> ones that decide how to partition the stream (or Kafka topic). Even after >> doing a bit of reading, I am still not sure if I should be writing this >> email here or on the Samza dev list. Anyway, my first guess is Kafka. >> >> I do not have experience with Kafka, however partial key grouping is very >> easy to implement: it requires just a few lines of code in Java when >> implemented as a custom grouping in Storm [3]. >> I believe it should be very easy to integrate. >> >> For all these reasons, I believe it will be a nice addition to >> Kafka/Samza. >> If the community thinks it's a good idea, I will be happy to offer support >> in the porting. >> >> References: >> [1] >> >> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf >> [2] https://issues.apache.org/jira/browse/STORM-632 >> [3] https://github.com/gdfm/partial-key-grouping >> -- >> Gianmarco >> > >