Re: Graphx : Perfomance comparison over cluster

Ankur Dave Fri, 18 Jul 2014 20:31:24 -0700

Thanks for your interest. I should point out that the numbers in the arXiv
paper are from GraphX running on top of a custom version of Spark with an
experimental in-memory shuffle prototype. As a result, if you benchmark
GraphX at the current master, it's expected that it will be 2-3x slower
than GraphLab.


The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in benchmarking it.

On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <shreyanshpbh...@gmail.com>
 wrote:
>
> Should I use the pagerank application already available in graphx for this 
> purpose
> or need to modify or need to write my own?


You should use the built-in PageRank. If your graph is available in edge
list format, you can run it using the Analytics driver as follows:

~/spark/bin/spark-submit --master spark://$MASTER_URL:7077 --class
org.apache.spark.graphx.lib.Analytics
~/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
pagerank $EDGE_FILE --numEPart=$NUM_PARTITIONS --numIter=$NUM_ITERATIONS
[--partStrategy=$PARTITION_STRATEGY]

What should be the executor_memory, i.e. maximum or according to graph
> size?


As much memory as possible while leaving room for the operating system.

Is there any other configuration I should do to have the best performance?


I think the parameters to Analytics above should be sufficient:

- numEPart - should be equal to or a small integer multiple of the number
of cores. More partitions improve work balance but also increase memory
usage and communication, so in some cases it can even be faster with fewer
partitions than cores.
- partStrategy - If your edges are already sorted, you can skip this
option, because GraphX will leave them as-is by default and that may be
close to optimal. Otherwise, EdgePartition2D and RandomVertexCut are both
worth trying.

CC'ing Joey and Dan, who may have other suggestions.

Ankur <http://www.ankurdave.com/>


On Fri, Jul 18, 2014 at 7:14 PM, ShreyanshB <shreyanshpbh...@gmail.com>
wrote:

> Hi,
>
> I am trying to compare Graphx and other distributed graph processing
> systems
> (graphlab) on my cluster of 64 nodes, each node having 32 cores and
> connected with infinite band.
>
> I looked at http://arxiv.org/pdf/1402.2394.pdf and stats provided over
> there. I had few questions regarding configuration and achieving best
> performance.
>
> * Should I use the pagerank application already available in graphx for
> this
> purpose or need to modify or need to write my own?
>    - If I shouldn't use the inbuilt pagerank, can you share your pagerank
> application?
>
> * What should be the executor_memory, i.e. maximum or according to graph
> size?
>
> * Other than, number of cores, executor_memory and partition strategy, Is
> there any other configuration I should do to have the best performance?
>
> I am using following script,
> import org.apache.spark._
> import org.apache.spark.graphx._
> import org.apache.spark.rdd.RDD
>
> val startgraphloading = System.currentTimeMillis;
> val graph = GraphLoader.edgeListFile(sc, "filepath",true,32)
> val endgraphloading = System.currentTimeMillis;
>
>
> Thanks in advance :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-Perfomance-comparison-over-cluster-tp10222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Graphx : Perfomance comparison over cluster

Reply via email to