I see, thanks for the clarification. Guozhang
On Tue, Apr 7, 2015 at 1:50 AM, Gianmarco De Francisci Morales < g...@apache.org> wrote: > Hi Guozhang, > > Thanks for your comments. > > 1) Yes, ordering cannot be guaranteed in PKG. In general, algorithms that > use PGK should compute commutative and associative functions of the input. > If you need strict ordering (i.e., the function is not commutative) within > a partition, use KG. > > 2) I am not sure I understand the issue. PKG does not deal with inter-topic > load balancing. Topic A and topic B are completely independent in our > framework. > > Cheers, > > -- > Gianmarco > > On 7 April 2015 at 02:56, Guozhang Wang <wangg...@gmail.com> wrote: > > > Gianmarco, > > > > I browse through your paper (congrats for the ICDE publication BTW!), and > > here are some questions / comments on the algorithm: > > > > 1. One motivation of enabling key-based partitioned in Kafka is to > achieve > > per-key ordering, i.e. with all messages with the same key sent to the > same > > partition their ordering is preserved. However with "key-splitting" that > > seems to break this guarantee and now messages with the same key may be > > sent to 2 (or generally speaking many) partitions. > > > > 2. As for the local load estimation, there is a second mapping from > > partitions (workers in your paper) to broker hosts beside the mapping > from > > keys to partitions, and not all broker hosts maintain each of the > > partitions. For example, there are 4 brokers, and broker-1/2 each takes > one > > of the two partitions of topic A, while broker-3/4 each takes one of the > > two partitions of topic B, etc. > > > > I am wondering if those two issues can be resolved with the PKG > framework? > > > > Guozhang > > > > On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales < > > g...@apache.org> wrote: > > > > > Hi Jay, > > > > > > Thanks, that sounds a necessary step. I guess I expected something like > > > that to be already there, at least internally. > > > I created KAFKA-2092 to track the PKG integration. > > > > > > Cheers, > > > > > > -- > > > Gianmarco > > > > > > On 4 April 2015 at 23:50, Jay Kreps <jay.kr...@gmail.com> wrote: > > > > > > > Hey guys, > > > > > > > > I think the first step here would be to expose a partitioner > interface > > > for > > > > the new producer that would make it easy to plug in these different > > > > strategies. I filed a JIRA for this: > > > > https://issues.apache.org/jira/browse/KAFKA-2091 > > > > > > > > -Jay > > > > > > > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote: > > > > > > > >> Gianmarco, > > > >> I am coming from storm community. I think PKG is a > > very > > > >> interesting and we can provide an implementation of Partitioner for > > PKG. > > > >> Can you open a JIRA for this. > > > >> > > > >> -- > > > >> Harsha > > > >> Sent with Airmail > > > >> > > > >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales ( > > > >> g...@apache.org) wrote: > > > >> > > > >> Hi, > > > >> > > > >> We have recently studied the problem of load balancing in > distributed > > > >> stream processing systems such as Samza [1]. > > > >> In particular, we focused on what happens when the key distribution > of > > > the > > > >> stream is skewed when using key grouping. > > > >> We developed a new stream partitioning scheme (which we call Partial > > Key > > > >> Grouping). It achieves better load balancing than hashing while > being > > > more > > > >> scalable than round robin in terms of memory. > > > >> > > > >> In the paper we show a number of mining algorithms that are easy to > > > >> implement with partial key grouping, and whose performance can > benefit > > > >> from > > > >> it. We think that it might also be useful for a larger class of > > > >> algorithms. > > > >> > > > >> PKG has already been integrated in Storm [2], and I would like to be > > > able > > > >> to use it in Samza as well. As far as I understand, Kafka producers > > are > > > >> the > > > >> ones that decide how to partition the stream (or Kafka topic). Even > > > after > > > >> doing a bit of reading, I am still not sure if I should be writing > > this > > > >> email here or on the Samza dev list. Anyway, my first guess is > Kafka. > > > >> > > > >> I do not have experience with Kafka, however partial key grouping is > > > very > > > >> easy to implement: it requires just a few lines of code in Java when > > > >> implemented as a custom grouping in Storm [3]. > > > >> I believe it should be very easy to integrate. > > > >> > > > >> For all these reasons, I believe it will be a nice addition to > > > >> Kafka/Samza. > > > >> If the community thinks it's a good idea, I will be happy to offer > > > support > > > >> in the porting. > > > >> > > > >> References: > > > >> [1] > > > >> > > > >> > > > > > > https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf > > > >> [2] https://issues.apache.org/jira/browse/STORM-632 > > > >> [3] https://github.com/gdfm/partial-key-grouping > > > >> -- > > > >> Gianmarco > > > >> > > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang