Re: [DISCUSS] New partitioning for better load balancing

Guozhang Wang Tue, 07 Apr 2015 09:54:57 -0700

I see, thanks for the clarification.

Guozhang


On Tue, Apr 7, 2015 at 1:50 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:

> Hi Guozhang,
>
> Thanks for your comments.
>
> 1) Yes, ordering cannot be guaranteed in PKG. In general, algorithms that
> use PGK should compute commutative and associative functions of the input.
> If you need strict ordering (i.e., the function is not commutative) within
> a partition, use KG.
>
> 2) I am not sure I understand the issue. PKG does not deal with inter-topic
> load balancing. Topic A and topic B are completely independent in our
> framework.
>
> Cheers,
>
> --
> Gianmarco
>
> On 7 April 2015 at 02:56, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Gianmarco,
> >
> > I browse through your paper (congrats for the ICDE publication BTW!), and
> > here are some questions / comments on the algorithm:
> >
> > 1. One motivation of enabling key-based partitioned in Kafka is to
> achieve
> > per-key ordering, i.e. with all messages with the same key sent to the
> same
> > partition their ordering is preserved. However with "key-splitting" that
> > seems to break this guarantee and now messages with the same key may be
> > sent to 2 (or generally speaking many) partitions.
> >
> > 2. As for the local load estimation, there is a second mapping from
> > partitions (workers in your paper) to broker hosts beside the mapping
> from
> > keys to partitions, and not all broker hosts maintain each of the
> > partitions. For example, there are 4 brokers, and broker-1/2 each takes
> one
> > of the two partitions of topic A, while broker-3/4 each takes one of the
> > two partitions of topic B, etc.
> >
> > I am wondering if those two issues can be resolved with the PKG
> framework?
> >
> > Guozhang
> >
> > On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
> > g...@apache.org> wrote:
> >
> > > Hi Jay,
> > >
> > > Thanks, that sounds a necessary step. I guess I expected something like
> > > that to be already there, at least internally.
> > > I created KAFKA-2092 to track the PKG integration.
> > >
> > > Cheers,
> > >
> > > --
> > > Gianmarco
> > >
> > > On 4 April 2015 at 23:50, Jay Kreps <jay.kr...@gmail.com> wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I think the first step here would be to expose a partitioner
> interface
> > > for
> > > > the new producer that would make it easy to plug in these different
> > > > strategies. I filed a JIRA for this:
> > > > https://issues.apache.org/jira/browse/KAFKA-2091
> > > >
> > > > -Jay
> > > >
> > > > On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote:
> > > >
> > > >> Gianmarco,
> > > >>                  I am coming from storm community. I think PKG is a
> > very
> > > >> interesting and we can provide an implementation of Partitioner for
> > PKG.
> > > >> Can you open a JIRA for this.
> > > >>
> > > >> --
> > > >> Harsha
> > > >> Sent with Airmail
> > > >>
> > > >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> > > >> g...@apache.org) wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> We have recently studied the problem of load balancing in
> distributed
> > > >> stream processing systems such as Samza [1].
> > > >> In particular, we focused on what happens when the key distribution
> of
> > > the
> > > >> stream is skewed when using key grouping.
> > > >> We developed a new stream partitioning scheme (which we call Partial
> > Key
> > > >> Grouping). It achieves better load balancing than hashing while
> being
> > > more
> > > >> scalable than round robin in terms of memory.
> > > >>
> > > >> In the paper we show a number of mining algorithms that are easy to
> > > >> implement with partial key grouping, and whose performance can
> benefit
> > > >> from
> > > >> it. We think that it might also be useful for a larger class of
> > > >> algorithms.
> > > >>
> > > >> PKG has already been integrated in Storm [2], and I would like to be
> > > able
> > > >> to use it in Samza as well. As far as I understand, Kafka producers
> > are
> > > >> the
> > > >> ones that decide how to partition the stream (or Kafka topic). Even
> > > after
> > > >> doing a bit of reading, I am still not sure if I should be writing
> > this
> > > >> email here or on the Samza dev list. Anyway, my first guess is
> Kafka.
> > > >>
> > > >> I do not have experience with Kafka, however partial key grouping is
> > > very
> > > >> easy to implement: it requires just a few lines of code in Java when
> > > >> implemented as a custom grouping in Storm [3].
> > > >> I believe it should be very easy to integrate.
> > > >>
> > > >> For all these reasons, I believe it will be a nice addition to
> > > >> Kafka/Samza.
> > > >> If the community thinks it's a good idea, I will be happy to offer
> > > support
> > > >> in the porting.
> > > >>
> > > >> References:
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> > > >> [2] https://issues.apache.org/jira/browse/STORM-632
> > > >> [3] https://github.com/gdfm/partial-key-grouping
> > > >> --
> > > >> Gianmarco
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [DISCUSS] New partitioning for better load balancing

Reply via email to