Re: [Proposal] Addition to Gelly

Samia Khalid Tue, 11 Aug 2015 09:23:51 -0700

Dear Andra,

The idea seems pretty nice.


I wonder how you decide the threshold to separate the high degree vertices
from the low degree vertices.

Regards,
Samia

On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <[email protected]> wrote:

> Hi Paris,
>
> Nice to virtually meet you too :)
>
> Maybe it makes sense to share my freshest chart:
>
> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing
>
> This is for the Community Detection algorithm [1] in which you basically
> find communities by continuously rescoring vertices. In a vertex-centric
> scenario (for a skewed graph, in this case, the Twitter follower network),
> since all vertices need to sync after an iteration superstep, they will
> remain idle until a high degree vertex terminates (computing an aggregate
> over my followers takes significantly less time than computing one over
> Obama's followers).
>
> What you have on the right is the time elapsed in the regular vertex
> centric Community Detection (more than two hours).
> And since we split the node in aggregation trees, in the left there is a
> chart with the various levels. Level 3 produces the best result.
>
> It's a bit cumbersome to put months of work in a mail, I am still polishing
> the documentation, but feel free to ask questions if this is still unclear.
>
>
> All in all, the technique is best for linear algorithms that work just in
> Pregel, for a skewed graph. On a high level, it's a generalisation of
> PowerGraph's mirroring, that can work on any system, and that does not
> restrict itself to level = 1 (like PowerGraph does) which in the case of
> out chart there did not bring us too far.
>
> Hope that clarified things a bit more!
> Andra
>
> [1] http://arxiv.org/pdf/0808.2633.pdf
>
>
> On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <[email protected]> wrote:
>
> > Hi Andra and nice to meet you btw :)
> >
> > It sounds like very fancy way to deal with skew, I like the idea even
> > though I am not a graph analytics expert.
> > Have you ran any experiments or benchmarks to see when this preferable ?
> > Users should be aware when they will get benefits by using it since node
> > splitting doesn’t come with no cost I guess.
> > I am really eager to see how this will evolve, I think it’s good effort.
> >
> > cheers
> > Paris
> >
> >
> > > On 11 Aug 2015, at 14:58, Andra Lungu <[email protected]> wrote:
> > >
> > > Hi Vasia,
> > >
> > > I shall polish the functions a bit, but this is more or less what I had
> > in
> > > mind:
> > > GSA Jaccard [what we have in Gelly right now]:
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java
> > > The same version with node split:
> > >
> >
> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java
> > >
> > > Yes, sure I can also share some charts with the results. I could say
> for
> > > which data sets (power law) and types of algorithms this can be used.
> If
> > > it's used appropriately, the overhead is 0.
> > >
> > > I'm open to suggestions!
> > > Thanks!
> > > Andra
> > >
> > >
> > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <
> > [email protected]
> > >> wrote:
> > >
> > >> Hi Andra,
> > >>
> > >> thanks for offering to add this work to Gelly and for starting the
> > >> discussion!
> > >>
> > >> How do you think this would look like from an API point of view? Is it
> > easy
> > >> to make it transparent to the application? Could you give us a simple
> > >> example of what you have in mind?
> > >>
> > >> Apart from usability, we should also consider documenting when to use
> > this
> > >> feature, i.e. for which algorithms, what kind degree distribution and
> > what
> > >> overhead to expect (if any).
> > >>
> > >> Cheers,
> > >> Vasia.
> > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <[email protected]>
> wrote:
> > >>
> > >>> Hey,
> > >>>
> > >>> Before actually opening a PR, I wanted to hear your opinion. So, here
> > >> goes
> > >>> nothing :).
> > >>>
> > >>> I'd like to add the core of my master thesis to Gelly. That is, a
> > series
> > >> of
> > >>> operators that take a skewed graph, split its high degree vertices
> into
> > >>> subvertices and redistribute the edges accordingly (thus producing
> the
> > >>> result faster due to the increased parallelism); then merge the
> > >> subvertices
> > >>> into the initial vertex.
> > >>>
> > >>> Here's part of my -unrevised- abstract:
> > >>> "The Twitter follower graph, the citation network and general purpose
> > >>> social networks follow a power law degree distribution. These highly
> > >> skewed
> > >>> graphs raise challenges to current computation models which uniformly
> > >>> process vertices, leading to communication overhead or large
> execution
> > >> time
> > >>> stalls.
> > >>>
> > >>> In spite of efforts made to scale up computation on natural graphs
> > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance
> > problems
> > >>> raise from a subset of vertices that have high degrees. In this
> thesis,
> > >> we
> > >>> outline the main processing issues that arise in current graph
> systems.
> > >> We
> > >>> then propose a novel node splitting technique meant to differentiate
> > the
> > >>> computation as well as partitioning strategies for low and high
> degree
> > >>> nodes.
> > >>>
> > >>> The “Node Splitting” abstraction is implemented as a separate set of
> > >>> operators that can be seamlessly combined with current
> degree-oblivious
> > >>> models such as Pregel and PowerGraph, but also with degree-aware
> > engines
> > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal
> overhead
> > >> in
> > >>> the absence of skew and improves the naive implementation of a subset
> > of
> > >>> computational intensive algorithms by a factor of two.
> > >>>
> > >>> These results have been proven on a theoretical and practical level
> on
> > >> both
> > >>> real world and synthetic graphs."
> > >>>
> > >>> What I would add:
> > >>> 1) the operators per se in a separate package
> > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and
> apply
> > >>> these operators on it)
> > >>> 3). tests
> > >>> 4). a separate section in the gelly docs that explains how to modify
> > your
> > >>> code to use this solution.
> > >>>
> > >>> Needless to say I will maintain the code and, if, as we get more
> users,
> > >>> they deem this sub-API useless, we can always deprecate and delete it
> > :).
> > >>>
> > >>> So, what do you think?
> > >>> Andra
> > >>>
> > >>
> >
> >
>

Re: [Proposal] Addition to Gelly

Reply via email to