I think skewed graphs can be considered quite common, i.e., not a corner
case.
So if there is code to significantly speed up computations on such graphs,
this would definitely be interesting for Gelly, IMO.

Would it be possible to integrate your approach with existing library
algorithms and offer a switch (boolean parameter) to execute the algorithm
either with the regular implementation or the skewed-optimized
implementation? That requires of course that both methods are semantically
equivalent, i.e., produce exactly the same result. If that is given, this
would be similar to choosing between a hash- and a sort-based join
algorithm.

2015-08-12 16:07 GMT+02:00 Stephan Ewen <se...@apache.org>:

> Same here as for Max, I am not familiar enough any more to make really good
> comments.
>
> Some generic comments, though:
>
>   - Check whether you really need a technique. Techniques that improve
> corner cases, but make the code much more complex and make the behavior of
> algorithms less robust are often better to leave out.
>
>   - It is usually easier to something as an optional extension, rather then
> adding it to a core operation. Easier to add, easier to remove if one finds
> that it does not work out as anticipated in hind sight
>
>   - If it works as a Gelly extension, you can always put it on your own
> github repository. The Gelly docs could list a set of known
> algorithms/operations on top of Gelly. If the interest in the extension is
> high, you can still reconsider to move it into the core codebase.
>
> Stephan
>
>
>
> On Wed, Aug 12, 2015 at 2:35 PM, Maximilian Michels <m...@apache.org>
> wrote:
>
> > I think this is a decision to be made by the people involved in the Gelly
> > library. I'm not very familiar with graph processing libraries. Thus, it
> is
> > hard for me to asses the value of this contribution.
> >
> > However, you outlined pretty well that for highly skewed graphs your
> > technique results in a much better performance. So even if your addition
> > might not be relevant to all users of the Gelly API, there are users that
> > could demand such processing techniques. Since you want to develop,
> > maintain, and document the code, I think there is absolutely nothing to
> say
> > against opening a pull request :)
> >
> > Cheers,
> > Max
> >
> >
> >
> > On Wed, Aug 12, 2015 at 12:16 PM, Andra Lungu <lungu.an...@gmail.com>
> > wrote:
> >
> > > I would love to get some feedback from the guys at data Artisans about
> > this
> > > one.
> > > So far, the comments originated and spread in the Stockholm area :)
> > >
> > > On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <lungu.an...@gmail.com>
> > > wrote:
> > >
> > > > Hi Samia,
> > > >
> > > > A good method to statistically determine skewed vertices was beyond
> the
> > > > purpose of my thesis. Unfortunately, the statistical methods that
> fit a
> > > > power law distribution don't do a good job. So what I do is that I
> plot
> > > the
> > > > degree distribution and then visually determine the threshold. That
> is
> > > also
> > > > how other state of the art systems, such as PowerLyra, fix the issue.
> > > >
> > > > A way to automate this would be a nice future addition :).
> > > >
> > > >
> > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <samk.3...@gmail.com>
> > > wrote:
> > > >
> > > >> Dear Andra,
> > > >>
> > > >> The idea seems pretty nice.
> > > >>
> > > >> I wonder how you decide the threshold to separate the high degree
> > > vertices
> > > >> from the low degree vertices.
> > > >>
> > > >> Regards,
> > > >> Samia
> > > >>
> > > >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.an...@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Hi Paris,
> > > >> >
> > > >> > Nice to virtually meet you too :)
> > > >> >
> > > >> > Maybe it makes sense to share my freshest chart:
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing
> > > >> >
> > > >> > This is for the Community Detection algorithm [1] in which you
> > > basically
> > > >> > find communities by continuously rescoring vertices. In a
> > > vertex-centric
> > > >> > scenario (for a skewed graph, in this case, the Twitter follower
> > > >> network),
> > > >> > since all vertices need to sync after an iteration superstep, they
> > > will
> > > >> > remain idle until a high degree vertex terminates (computing an
> > > >> aggregate
> > > >> > over my followers takes significantly less time than computing one
> > > over
> > > >> > Obama's followers).
> > > >> >
> > > >> > What you have on the right is the time elapsed in the regular
> vertex
> > > >> > centric Community Detection (more than two hours).
> > > >> > And since we split the node in aggregation trees, in the left
> there
> > > is a
> > > >> > chart with the various levels. Level 3 produces the best result.
> > > >> >
> > > >> > It's a bit cumbersome to put months of work in a mail, I am still
> > > >> polishing
> > > >> > the documentation, but feel free to ask questions if this is still
> > > >> unclear.
> > > >> >
> > > >> >
> > > >> > All in all, the technique is best for linear algorithms that work
> > just
> > > >> in
> > > >> > Pregel, for a skewed graph. On a high level, it's a generalisation
> > of
> > > >> > PowerGraph's mirroring, that can work on any system, and that does
> > not
> > > >> > restrict itself to level = 1 (like PowerGraph does) which in the
> > case
> > > of
> > > >> > out chart there did not bring us too far.
> > > >> >
> > > >> > Hope that clarified things a bit more!
> > > >> > Andra
> > > >> >
> > > >> > [1] http://arxiv.org/pdf/0808.2633.pdf
> > > >> >
> > > >> >
> > > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <par...@kth.se>
> > wrote:
> > > >> >
> > > >> > > Hi Andra and nice to meet you btw :)
> > > >> > >
> > > >> > > It sounds like very fancy way to deal with skew, I like the idea
> > > even
> > > >> > > though I am not a graph analytics expert.
> > > >> > > Have you ran any experiments or benchmarks to see when this
> > > >> preferable ?
> > > >> > > Users should be aware when they will get benefits by using it
> > since
> > > >> node
> > > >> > > splitting doesn’t come with no cost I guess.
> > > >> > > I am really eager to see how this will evolve, I think it’s good
> > > >> effort.
> > > >> > >
> > > >> > > cheers
> > > >> > > Paris
> > > >> > >
> > > >> > >
> > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.an...@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > Hi Vasia,
> > > >> > > >
> > > >> > > > I shall polish the functions a bit, but this is more or less
> > what
> > > I
> > > >> had
> > > >> > > in
> > > >> > > > mind:
> > > >> > > > GSA Jaccard [what we have in Gelly right now]:
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java
> > > >> > > > The same version with node split:
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java
> > > >> > > >
> > > >> > > > Yes, sure I can also share some charts with the results. I
> could
> > > say
> > > >> > for
> > > >> > > > which data sets (power law) and types of algorithms this can
> be
> > > >> used.
> > > >> > If
> > > >> > > > it's used appropriately, the overhead is 0.
> > > >> > > >
> > > >> > > > I'm open to suggestions!
> > > >> > > > Thanks!
> > > >> > > > Andra
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <
> > > >> > > vasilikikala...@gmail.com
> > > >> > > >> wrote:
> > > >> > > >
> > > >> > > >> Hi Andra,
> > > >> > > >>
> > > >> > > >> thanks for offering to add this work to Gelly and for
> starting
> > > the
> > > >> > > >> discussion!
> > > >> > > >>
> > > >> > > >> How do you think this would look like from an API point of
> > view?
> > > >> Is it
> > > >> > > easy
> > > >> > > >> to make it transparent to the application? Could you give us
> a
> > > >> simple
> > > >> > > >> example of what you have in mind?
> > > >> > > >>
> > > >> > > >> Apart from usability, we should also consider documenting
> when
> > to
> > > >> use
> > > >> > > this
> > > >> > > >> feature, i.e. for which algorithms, what kind degree
> > distribution
> > > >> and
> > > >> > > what
> > > >> > > >> overhead to expect (if any).
> > > >> > > >>
> > > >> > > >> Cheers,
> > > >> > > >> Vasia.
> > > >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <
> lungu.an...@gmail.com
> > >
> > > >> > wrote:
> > > >> > > >>
> > > >> > > >>> Hey,
> > > >> > > >>>
> > > >> > > >>> Before actually opening a PR, I wanted to hear your opinion.
> > So,
> > > >> here
> > > >> > > >> goes
> > > >> > > >>> nothing :).
> > > >> > > >>>
> > > >> > > >>> I'd like to add the core of my master thesis to Gelly. That
> > is,
> > > a
> > > >> > > series
> > > >> > > >> of
> > > >> > > >>> operators that take a skewed graph, split its high degree
> > > vertices
> > > >> > into
> > > >> > > >>> subvertices and redistribute the edges accordingly (thus
> > > producing
> > > >> > the
> > > >> > > >>> result faster due to the increased parallelism); then merge
> > the
> > > >> > > >> subvertices
> > > >> > > >>> into the initial vertex.
> > > >> > > >>>
> > > >> > > >>> Here's part of my -unrevised- abstract:
> > > >> > > >>> "The Twitter follower graph, the citation network and
> general
> > > >> purpose
> > > >> > > >>> social networks follow a power law degree distribution.
> These
> > > >> highly
> > > >> > > >> skewed
> > > >> > > >>> graphs raise challenges to current computation models which
> > > >> uniformly
> > > >> > > >>> process vertices, leading to communication overhead or large
> > > >> > execution
> > > >> > > >> time
> > > >> > > >>> stalls.
> > > >> > > >>>
> > > >> > > >>> In spite of efforts made to scale up computation on natural
> > > graphs
> > > >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many
> > performance
> > > >> > > problems
> > > >> > > >>> raise from a subset of vertices that have high degrees. In
> > this
> > > >> > thesis,
> > > >> > > >> we
> > > >> > > >>> outline the main processing issues that arise in current
> graph
> > > >> > systems.
> > > >> > > >> We
> > > >> > > >>> then propose a novel node splitting technique meant to
> > > >> differentiate
> > > >> > > the
> > > >> > > >>> computation as well as partitioning strategies for low and
> > high
> > > >> > degree
> > > >> > > >>> nodes.
> > > >> > > >>>
> > > >> > > >>> The “Node Splitting” abstraction is implemented as a
> separate
> > > set
> > > >> of
> > > >> > > >>> operators that can be seamlessly combined with current
> > > >> > degree-oblivious
> > > >> > > >>> models such as Pregel and PowerGraph, but also with
> > degree-aware
> > > >> > > engines
> > > >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces
> minimal
> > > >> > overhead
> > > >> > > >> in
> > > >> > > >>> the absence of skew and improves the naive implementation
> of a
> > > >> subset
> > > >> > > of
> > > >> > > >>> computational intensive algorithms by a factor of two.
> > > >> > > >>>
> > > >> > > >>> These results have been proven on a theoretical and
> practical
> > > >> level
> > > >> > on
> > > >> > > >> both
> > > >> > > >>> real world and synthetic graphs."
> > > >> > > >>>
> > > >> > > >>> What I would add:
> > > >> > > >>> 1) the operators per se in a separate package
> > > >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity
> > and
> > > >> > apply
> > > >> > > >>> these operators on it)
> > > >> > > >>> 3). tests
> > > >> > > >>> 4). a separate section in the gelly docs that explains how
> to
> > > >> modify
> > > >> > > your
> > > >> > > >>> code to use this solution.
> > > >> > > >>>
> > > >> > > >>> Needless to say I will maintain the code and, if, as we get
> > more
> > > >> > users,
> > > >> > > >>> they deem this sub-API useless, we can always deprecate and
> > > >> delete it
> > > >> > > :).
> > > >> > > >>>
> > > >> > > >>> So, what do you think?
> > > >> > > >>> Andra
> > > >> > > >>>
> > > >> > > >>
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Reply via email to