Hey guys, I think the first step here would be to expose a partitioner interface for the new producer that would make it easy to plug in these different strategies. I filed a JIRA for this: https://issues.apache.org/jira/browse/KAFKA-2091
-Jay On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote: > Gianmarco, > I am coming from storm community. I think PKG is a very > interesting and we can provide an implementation of Partitioner for PKG. > Can you open a JIRA for this. > > -- > Harsha > Sent with Airmail > > On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales ( > g...@apache.org) wrote: > > Hi, > > We have recently studied the problem of load balancing in distributed > stream processing systems such as Samza [1]. > In particular, we focused on what happens when the key distribution of the > stream is skewed when using key grouping. > We developed a new stream partitioning scheme (which we call Partial Key > Grouping). It achieves better load balancing than hashing while being more > scalable than round robin in terms of memory. > > In the paper we show a number of mining algorithms that are easy to > implement with partial key grouping, and whose performance can benefit from > it. We think that it might also be useful for a larger class of algorithms. > > PKG has already been integrated in Storm [2], and I would like to be able > to use it in Samza as well. As far as I understand, Kafka producers are the > ones that decide how to partition the stream (or Kafka topic). Even after > doing a bit of reading, I am still not sure if I should be writing this > email here or on the Samza dev list. Anyway, my first guess is Kafka. > > I do not have experience with Kafka, however partial key grouping is very > easy to implement: it requires just a few lines of code in Java when > implemented as a custom grouping in Storm [3]. > I believe it should be very easy to integrate. > > For all these reasons, I believe it will be a nice addition to Kafka/Samza. > If the community thinks it's a good idea, I will be happy to offer support > in the porting. > > References: > [1] > > https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf > [2] https://issues.apache.org/jira/browse/STORM-632 > [3] https://github.com/gdfm/partial-key-grouping > -- > Gianmarco >