Hi Andra and nice to meet you btw :) It sounds like very fancy way to deal with skew, I like the idea even though I am not a graph analytics expert. Have you ran any experiments or benchmarks to see when this preferable ? Users should be aware when they will get benefits by using it since node splitting doesn’t come with no cost I guess. I am really eager to see how this will evolve, I think it’s good effort.
cheers Paris > On 11 Aug 2015, at 14:58, Andra Lungu <lungu.an...@gmail.com> wrote: > > Hi Vasia, > > I shall polish the functions a bit, but this is more or less what I had in > mind: > GSA Jaccard [what we have in Gelly right now]: > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSAJaccardSimilarityMeasure.java > The same version with node split: > https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java > > Yes, sure I can also share some charts with the results. I could say for > which data sets (power law) and types of algorithms this can be used. If > it's used appropriately, the overhead is 0. > > I'm open to suggestions! > Thanks! > Andra > > > On Tue, Aug 11, 2015 at 2:45 PM, Vasiliki Kalavri <vasilikikala...@gmail.com >> wrote: > >> Hi Andra, >> >> thanks for offering to add this work to Gelly and for starting the >> discussion! >> >> How do you think this would look like from an API point of view? Is it easy >> to make it transparent to the application? Could you give us a simple >> example of what you have in mind? >> >> Apart from usability, we should also consider documenting when to use this >> feature, i.e. for which algorithms, what kind degree distribution and what >> overhead to expect (if any). >> >> Cheers, >> Vasia. >> On Aug 10, 2015 10:47 AM, "Andra Lungu" <lungu.an...@gmail.com> wrote: >> >>> Hey, >>> >>> Before actually opening a PR, I wanted to hear your opinion. So, here >> goes >>> nothing :). >>> >>> I'd like to add the core of my master thesis to Gelly. That is, a series >> of >>> operators that take a skewed graph, split its high degree vertices into >>> subvertices and redistribute the edges accordingly (thus producing the >>> result faster due to the increased parallelism); then merge the >> subvertices >>> into the initial vertex. >>> >>> Here's part of my -unrevised- abstract: >>> "The Twitter follower graph, the citation network and general purpose >>> social networks follow a power law degree distribution. These highly >> skewed >>> graphs raise challenges to current computation models which uniformly >>> process vertices, leading to communication overhead or large execution >> time >>> stalls. >>> >>> In spite of efforts made to scale up computation on natural graphs >>> (PowerGraph ’s Gather - Apply - Scatter model), many performance problems >>> raise from a subset of vertices that have high degrees. In this thesis, >> we >>> outline the main processing issues that arise in current graph systems. >> We >>> then propose a novel node splitting technique meant to differentiate the >>> computation as well as partitioning strategies for low and high degree >>> nodes. >>> >>> The “Node Splitting” abstraction is implemented as a separate set of >>> operators that can be seamlessly combined with current degree-oblivious >>> models such as Pregel and PowerGraph, but also with degree-aware engines >>> such as Pregel+ and PowerLyra. Our solution introduces minimal overhead >> in >>> the absence of skew and improves the naive implementation of a subset of >>> computational intensive algorithms by a factor of two. >>> >>> These results have been proven on a theoretical and practical level on >> both >>> real world and synthetic graphs." >>> >>> What I would add: >>> 1) the operators per se in a separate package >>> 2) a slightly modified example (e.g. take Jaccard Similarity and apply >>> these operators on it) >>> 3). tests >>> 4). a separate section in the gelly docs that explains how to modify your >>> code to use this solution. >>> >>> Needless to say I will maintain the code and, if, as we get more users, >>> they deem this sub-API useless, we can always deprecate and delete it :). >>> >>> So, what do you think? >>> Andra >>> >>