I would love to get some feedback from the guys at data Artisans about this one. So far, the comments originated and spread in the Stockholm area :)
On Tue, Aug 11, 2015 at 6:33 PM, Andra Lungu <lungu.an...@gmail.com> wrote: > Hi Samia, > > A good method to statistically determine skewed vertices was beyond the > purpose of my thesis. Unfortunately, the statistical methods that fit a > power law distribution don't do a good job. So what I do is that I plot the > degree distribution and then visually determine the threshold. That is also > how other state of the art systems, such as PowerLyra, fix the issue. > > A way to automate this would be a nice future addition :). > > > On Tue, Aug 11, 2015 at 6:23 PM, Samia Khalid <samk.3...@gmail.com> wrote: > >> Dear Andra, >> >> The idea seems pretty nice. >> >> I wonder how you decide the threshold to separate the high degree vertices >> from the low degree vertices. >> >> Regards, >> Samia >> >> On Tue, Aug 11, 2015 at 3:41 PM, Andra Lungu <lungu.an...@gmail.com> >> wrote: >> >> > Hi Paris, >> > >> > Nice to virtually meet you too :) >> > >> > Maybe it makes sense to share my freshest chart: >> > >> > >> https://drive.google.com/file/d/0BwnaKJcSLc43Qm9fZV9RUE5zT1E/view?usp=sharing >> > >> > This is for the Community Detection algorithm [1] in which you basically >> > find communities by continuously rescoring vertices. In a vertex-centric >> > scenario (for a skewed graph, in this case, the Twitter follower >> network), >> > since all vertices need to sync after an iteration superstep, they will >> > remain idle until a high degree vertex terminates (computing an >> aggregate >> > over my followers takes significantly less time than computing one over >> > Obama's followers). >> > >> > What you have on the right is the time elapsed in the regular vertex >> > centric Community Detection (more than two hours). >> > And since we split the node in aggregation trees, in the left there is a >> > chart with the various levels. Level 3 produces the best result. >> > >> > It's a bit cumbersome to put months of work in a mail, I am still >> polishing >> > the documentation, but feel free to ask questions if this is still >> unclear. >> > >> > >> > All in all, the technique is best for linear algorithms that work just >> in >> > Pregel, for a skewed graph. On a high level, it's a generalisation of >> > PowerGraph's mirroring, that can work on any system, and that does not >> > restrict itself to level = 1 (like PowerGraph does) which in the case of >> > out chart there did not bring us too far. >> > >> > Hope that clarified things a bit more! >> > Andra >> > >> > [1] http://arxiv.org/pdf/0808.2633.pdf >> > >> > >> > On Tue, Aug 11, 2015 at 3:20 PM, Paris Carbone <par...@kth.se> wrote: >> > >> > > Hi Andra and nice to meet you btw :) >> > > >> > > It sounds like very fancy way to deal with skew, I like the idea even >> > > though I am not a graph analytics expert. >> > > Have you ran any experiments or benchmarks to see when this >> preferable ? >> > > Users should be aware when they will get benefits by using it since >> node >> > > splitting doesn’t come with no cost I guess. >> > > I am really eager to see how this will evolve, I think it’s good >> effort. >> > > >> > > cheers >> > > Paris >> > > >> > > >> > > > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.an...@gmail.com> >> wrote: >> > > > >> > > > Hi Vasia, >> > > > >> > > > I shall polish the functions a bit, but this is more or less what I >> had >> > > in >> > > > mind: >> > > > GSA Jaccard [what we have in Gelly right now]: >> > > > >> > > >> > >> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java >> > > > The same version with node split: >> > > > >> > > >> > >> https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java >> > > > >> > > > Yes, sure I can also share some charts with the results. I could say >> > for >> > > > which data sets (power law) and types of algorithms this can be >> used. >> > If >> > > > it's used appropriately, the overhead is 0. >> > > > >> > > > I'm open to suggestions! >> > > > Thanks! >> > > > Andra >> > > > >> > > > >> > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri < >> > > vasilikikala...@gmail.com >> > > >> wrote: >> > > > >> > > >> Hi Andra, >> > > >> >> > > >> thanks for offering to add this work to Gelly and for starting the >> > > >> discussion! >> > > >> >> > > >> How do you think this would look like from an API point of view? >> Is it >> > > easy >> > > >> to make it transparent to the application? Could you give us a >> simple >> > > >> example of what you have in mind? >> > > >> >> > > >> Apart from usability, we should also consider documenting when to >> use >> > > this >> > > >> feature, i.e. for which algorithms, what kind degree distribution >> and >> > > what >> > > >> overhead to expect (if any). >> > > >> >> > > >> Cheers, >> > > >> Vasia. >> > > >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.an...@gmail.com> >> > wrote: >> > > >> >> > > >>> Hey, >> > > >>> >> > > >>> Before actually opening a PR, I wanted to hear your opinion. So, >> here >> > > >> goes >> > > >>> nothing :). >> > > >>> >> > > >>> I'd like to add the core of my master thesis to Gelly. That is, a >> > > series >> > > >> of >> > > >>> operators that take a skewed graph, split its high degree vertices >> > into >> > > >>> subvertices and redistribute the edges accordingly (thus producing >> > the >> > > >>> result faster due to the increased parallelism); then merge the >> > > >> subvertices >> > > >>> into the initial vertex. >> > > >>> >> > > >>> Here's part of my -unrevised- abstract: >> > > >>> "The Twitter follower graph, the citation network and general >> purpose >> > > >>> social networks follow a power law degree distribution. These >> highly >> > > >> skewed >> > > >>> graphs raise challenges to current computation models which >> uniformly >> > > >>> process vertices, leading to communication overhead or large >> > execution >> > > >> time >> > > >>> stalls. >> > > >>> >> > > >>> In spite of efforts made to scale up computation on natural graphs >> > > >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance >> > > problems >> > > >>> raise from a subset of vertices that have high degrees. In this >> > thesis, >> > > >> we >> > > >>> outline the main processing issues that arise in current graph >> > systems. >> > > >> We >> > > >>> then propose a novel node splitting technique meant to >> differentiate >> > > the >> > > >>> computation as well as partitioning strategies for low and high >> > degree >> > > >>> nodes. >> > > >>> >> > > >>> The “Node Splitting” abstraction is implemented as a separate set >> of >> > > >>> operators that can be seamlessly combined with current >> > degree-oblivious >> > > >>> models such as Pregel and PowerGraph, but also with degree-aware >> > > engines >> > > >>> such as Pregel+ and PowerLyra. Our solution introduces minimal >> > overhead >> > > >> in >> > > >>> the absence of skew and improves the naive implementation of a >> subset >> > > of >> > > >>> computational intensive algorithms by a factor of two. >> > > >>> >> > > >>> These results have been proven on a theoretical and practical >> level >> > on >> > > >> both >> > > >>> real world and synthetic graphs." >> > > >>> >> > > >>> What I would add: >> > > >>> 1) the operators per se in a separate package >> > > >>> 2) a slightly modified example (e.g. take Jaccard Similarity and >> > apply >> > > >>> these operators on it) >> > > >>> 3). tests >> > > >>> 4). a separate section in the gelly docs that explains how to >> modify >> > > your >> > > >>> code to use this solution. >> > > >>> >> > > >>> Needless to say I will maintain the code and, if, as we get more >> > users, >> > > >>> they deem this sub-API useless, we can always deprecate and >> delete it >> > > :). >> > > >>> >> > > >>> So, what do you think? >> > > >>> Andra >> > > >>> >> > > >> >> > > >> > > >> > >> > >