Hi Ryan, If you can get past the paperwork, I'm sure this can make a great Spark Package (http://spark-packages.org). People then can use it for benchmarking purposes, and I'm sure people will be looking for graph generators!
Best, Burak On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan <[email protected]> wrote: > Hi Spark Devs, > > As part of a project at work, I have written a graph generator for > RMAT graphs consistent with the specifications in the Graph 500 benchmark ( > http://www.graph500.org/specifications). We had originally planned to use > the rmatGenerator function in GraphGenerators, but found that it wasn’t > suitable for generating graphs with billions of edges; the edges are > generated in a single thread and stored in a Set, meaning it can’t generate > a graph larger than memory on a single JVM (and I think Sets are limited to > Int.MaxValue > elements anyway). > > The generator I have is essentially a more scalable version of > rmatGenerator. We have used it to generate a graph with 2^32 vertices and > 2^36 edges on our modestly-specced cluster of 16 machines. It seems like > other people interested in Spark might want to play with some large RMAT > graphs (or run the Graph 500 benchmark), so I would like to contribute my > generator. It does have some minor differences from the current > generator, though: > > 1. Vertex IDs are shuffled after the graph structure is generated, so > the degree of a vertex cannot be predicted from its ID (without this step > vertex 0 would always have the largest degree, followed by vertices > 1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made > optional. > 2. Duplicate edges are not removed from the resulting graph. This > could easily be done with a call to distinct() on the resulting edge list, > but then there would be slightly fewer edges than one generated by the > current rmatGenerator. Also this process would be very slow on large graphs > due to skew. > 3. Doesn’t set the out degree as the vertex attribute. Again this > would be simple to add, but it could be slow on the super vertices. > > My question for the Spark Devs is: Is this something you would want as > part of GraphX (either as a replacement for the current rmatGenerator or a > separate function in GraphGenerators)? Since it was developed at work I > need to go through our legal department and QA processes to open-source it, > and to fill out the paperwork I need to know whether I’ll be submitting a > pull request or standing it up as a separate project on GitHub. > > Thanks! > > -Ryan > > -- > J. Ryan Carr, Ph. D. > > The Johns Hopkins University, Applied Physics Laboratory > 11100 Johns Hopkins Rd., Laurel, MD 20723 > Office: 240-228-9157 > Cell: 443-744-1004 > Email: *[email protected] <[email protected]>* or *[email protected] > <[email protected]>* > >
