Dear Andra, The idea seems pretty nice.
I wonder how you decide the threshold to separate the high degree vertices from the low degree vertices. Regards, Samia On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.an...@gmail.com> wrote: > Hi Paris, > > Nice to virtually meet you too :) > > Maybe it makes sense to share my freshest chart: > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > This is for the Community Detection algorithm [1] in which you basically > find communities by continuously rescoring vertices. In a vertex-centric > scenario (for a skewed graph, in this case, the Twitter follower network), > since all vertices need to sync after an iteration superstep, they will > remain idle until a high degree vertex terminates (computing an aggregate > over my followers takes significantly less time than computing one over > Obama's followers). > > What you have on the right is the time elapsed in the regular vertex > centric Community Detection (more than two hours). > And since we split the node in aggregation trees, in the left there is a > chart with the various levels. Level 3 produces the best result. > > It's a bit cumbersome to put months of work in a mail, I am still polishing > the documentation, but feel free to ask questions if this is still unclear. > > > All in all, the technique is best for linear algorithms that work just in > Pregel, for a skewed graph. On a high level, it's a generalisation of > PowerGraph's mirroring, that can work on any system, and that does not > restrict itself to level = 1 (like PowerGraph does) which in the case of > out chart there did not bring us too far. > > Hope that clarified things a bit more! > Andra > > [1] http://arxiv.org/pdf/0808.2633.pdf > > > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <par...@kth.se> wrote: > > > Hi Andra and nice to meet you btw :) > > > > It sounds like very fancy way to deal with skew, I like the idea even > > though I am not a graph analytics expert. > > Have you ran any experiments or benchmarks to see when this preferable ? > > Users should be aware when they will get benefits by using it since node > > splitting doesn’t come with no cost I guess. > > I am really eager to see how this will evolve, I think it’s good effort. > > > > cheers > > Paris > > > > > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.an...@gmail.com> wrote: > > > > > > Hi Vasia, > > > > > > I shall polish the functions a bit, but this is more or less what I had > > in > > > mind: > > > GSA Jaccard [what we have in Gelly right now]: > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > > The same version with node split: > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > > > > Yes, sure I can also share some charts with the results. I could say > for > > > which data sets (power law) and types of algorithms this can be used. > If > > > it's used appropriately, the overhead is 0. > > > > > > I'm open to suggestions! > > > Thanks! > > > Andra > > > > > > > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > vasilikikala...@gmail.com > > >> wrote: > > > > > >> Hi Andra, > > >> > > >> thanks for offering to add this work to Gelly and for starting the > > >> discussion! > > >> > > >> How do you think this would look like from an API point of view? Is it > > easy > > >> to make it transparent to the application? Could you give us a simple > > >> example of what you have in mind? > > >> > > >> Apart from usability, we should also consider documenting when to use > > this > > >> feature, i.e. for which algorithms, what kind degree distribution and > > what > > >> overhead to expect (if any). > > >> > > >> Cheers, > > >> Vasia. > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.an...@gmail.com> > wrote: > > >> > > >>> Hey, > > >>> > > >>> Before actually opening a PR, I wanted to hear your opinion. So, here > > >> goes > > >>> nothing :). > > >>> > > >>> I'd like to add the core of my master thesis to Gelly. That is, a > > series > > >> of > > >>> operators that take a skewed graph, split its high degree vertices > into > > >>> subvertices and redistribute the edges accordingly (thus producing > the > > >>> result faster due to the increased parallelism); then merge the > > >> subvertices > > >>> into the initial vertex. > > >>> > > >>> Here's part of my -unrevised- abstract: > > >>> "The Twitter follower graph, the citation network and general purpose > > >>> social networks follow a power law degree distribution. These highly > > >> skewed > > >>> graphs raise challenges to current computation models which uniformly > > >>> process vertices, leading to communication overhead or large > execution > > >> time > > >>> stalls. > > >>> > > >>> In spite of efforts made to scale up computation on natural graphs > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > > problems > > >>> raise from a subset of vertices that have high degrees. In this > thesis, > > >> we > > >>> outline the main processing issues that arise in current graph > systems. > > >> We > > >>> then propose a novel node splitting technique meant to differentiate > > the > > >>> computation as well as partitioning strategies for low and high > degree > > >>> nodes. > > >>> > > >>> The “Node Splitting” abstraction is implemented as a separate set of > > >>> operators that can be seamlessly combined with current > degree-oblivious > > >>> models such as Pregel and PowerGraph, but also with degree-aware > > engines > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > overhead > > >> in > > >>> the absence of skew and improves the naive implementation of a subset > > of > > >>> computational intensive algorithms by a factor of two. > > >>> > > >>> These results have been proven on a theoretical and practical level > on > > >> both > > >>> real world and synthetic graphs." > > >>> > > >>> What I would add: > > >>> 1) the operators per se in a separate package > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and > apply > > >>> these operators on it) > > >>> 3). tests > > >>> 4). a separate section in the gelly docs that explains how to modify > > your > > >>> code to use this solution. > > >>> > > >>> Needless to say I will maintain the code and, if, as we get more > users, > > >>> they deem this sub-API useless, we can always deprecate and delete it > > :). > > >>> > > >>> So, what do you think? > > >>> Andra > > >>> > > >> > > > > >