Hi Samia, A good method to statistically determine skewed vertices was beyond the purpose of my thesis. Unfortunately, the statistical methods that fit a power law distribution don't do a good job. So what I do is that I plot the degree distribution and then visually determine the threshold. That is also how other state of the art systems, such as PowerLyra, fix the issue.
A way to automate this would be a nice future addition :). On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <samk.3...@gmail.com> wrote: > Dear Andra, > > The idea seems pretty nice. > > I wonder how you decide the threshold to separate the high degree vertices > from the low degree vertices. > > Regards, > Samia > > On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.an...@gmail.com> > wrote: > > > Hi Paris, > > > > Nice to virtually meet you too :) > > > > Maybe it makes sense to share my freshest chart: > > > > > https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing > > > > This is for the Community Detection algorithm [1] in which you basically > > find communities by continuously rescoring vertices. In a vertex-centric > > scenario (for a skewed graph, in this case, the Twitter follower > network), > > since all vertices need to sync after an iteration superstep, they will > > remain idle until a high degree vertex terminates (computing an aggregate > > over my followers takes significantly less time than computing one over > > Obama's followers). > > > > What you have on the right is the time elapsed in the regular vertex > > centric Community Detection (more than two hours). > > And since we split the node in aggregation trees, in the left there is a > > chart with the various levels. Level 3 produces the best result. > > > > It's a bit cumbersome to put months of work in a mail, I am still > polishing > > the documentation, but feel free to ask questions if this is still > unclear. > > > > > > All in all, the technique is best for linear algorithms that work just in > > Pregel, for a skewed graph. On a high level, it's a generalisation of > > PowerGraph's mirroring, that can work on any system, and that does not > > restrict itself to level = 1 (like PowerGraph does) which in the case of > > out chart there did not bring us too far. > > > > Hope that clarified things a bit more! > > Andra > > > > [1] http://arxiv.org/pdf/0808.2633.pdf > > > > > > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <par...@kth.se> wrote: > > > > > Hi Andra and nice to meet you btw :) > > > > > > It sounds like very fancy way to deal with skew, I like the idea even > > > though I am not a graph analytics expert. > > > Have you ran any experiments or benchmarks to see when this preferable > ? > > > Users should be aware when they will get benefits by using it since > node > > > splitting doesn’t come with no cost I guess. > > > I am really eager to see how this will evolve, I think it’s good > effort. > > > > > > cheers > > > Paris > > > > > > > > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.an...@gmail.com> wrote: > > > > > > > > Hi Vasia, > > > > > > > > I shall polish the functions a bit, but this is more or less what I > had > > > in > > > > mind: > > > > GSA Jaccard [what we have in Gelly right now]: > > > > > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > > > > The same version with node split: > > > > > > > > > > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > > > > > > > Yes, sure I can also share some charts with the results. I could say > > for > > > > which data sets (power law) and types of algorithms this can be used. > > If > > > > it's used appropriately, the overhead is 0. > > > > > > > > I'm open to suggestions! > > > > Thanks! > > > > Andra > > > > > > > > > > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < > > > vasilikikala...@gmail.com > > > >> wrote: > > > > > > > >> Hi Andra, > > > >> > > > >> thanks for offering to add this work to Gelly and for starting the > > > >> discussion! > > > >> > > > >> How do you think this would look like from an API point of view? Is > it > > > easy > > > >> to make it transparent to the application? Could you give us a > simple > > > >> example of what you have in mind? > > > >> > > > >> Apart from usability, we should also consider documenting when to > use > > > this > > > >> feature, i.e. for which algorithms, what kind degree distribution > and > > > what > > > >> overhead to expect (if any). > > > >> > > > >> Cheers, > > > >> Vasia. > > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.an...@gmail.com> > > wrote: > > > >> > > > >>> Hey, > > > >>> > > > >>> Before actually opening a PR, I wanted to hear your opinion. So, > here > > > >> goes > > > >>> nothing :). > > > >>> > > > >>> I'd like to add the core of my master thesis to Gelly. That is, a > > > series > > > >> of > > > >>> operators that take a skewed graph, split its high degree vertices > > into > > > >>> subvertices and redistribute the edges accordingly (thus producing > > the > > > >>> result faster due to the increased parallelism); then merge the > > > >> subvertices > > > >>> into the initial vertex. > > > >>> > > > >>> Here's part of my -unrevised- abstract: > > > >>> "The Twitter follower graph, the citation network and general > purpose > > > >>> social networks follow a power law degree distribution. These > highly > > > >> skewed > > > >>> graphs raise challenges to current computation models which > uniformly > > > >>> process vertices, leading to communication overhead or large > > execution > > > >> time > > > >>> stalls. > > > >>> > > > >>> In spite of efforts made to scale up computation on natural graphs > > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance > > > problems > > > >>> raise from a subset of vertices that have high degrees. In this > > thesis, > > > >> we > > > >>> outline the main processing issues that arise in current graph > > systems. > > > >> We > > > >>> then propose a novel node splitting technique meant to > differentiate > > > the > > > >>> computation as well as partitioning strategies for low and high > > degree > > > >>> nodes. > > > >>> > > > >>> The “Node Splitting” abstraction is implemented as a separate set > of > > > >>> operators that can be seamlessly combined with current > > degree-oblivious > > > >>> models such as Pregel and PowerGraph, but also with degree-aware > > > engines > > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal > > overhead > > > >> in > > > >>> the absence of skew and improves the naive implementation of a > subset > > > of > > > >>> computational intensive algorithms by a factor of two. > > > >>> > > > >>> These results have been proven on a theoretical and practical level > > on > > > >> both > > > >>> real world and synthetic graphs." > > > >>> > > > >>> What I would add: > > > >>> 1) the operators per se in a separate package > > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and > > apply > > > >>> these operators on it) > > > >>> 3). tests > > > >>> 4). a separate section in the gelly docs that explains how to > modify > > > your > > > >>> code to use this solution. > > > >>> > > > >>> Needless to say I will maintain the code and, if, as we get more > > users, > > > >>> they deem this sub-API useless, we can always deprecate and delete > it > > > :). > > > >>> > > > >>> So, what do you think? > > > >>> Andra > > > >>> > > > >> > > > > > > > > >