At 2014-08-04 20:52:26 +0800, Bin <wubin_phi...@126.com> wrote: > I wonder how spark parameters, e.g., number of paralellism, affect Pregel > performance? Specifically, sendmessage, mergemessage, and vertexprogram? > > I have tried label propagation on a 300,000 edges graph, and I found that no > paralellism is much faster than 5 or 500 paralellism.
Increasing the level of parallelism will increase storage overhead (because each vertex will need to be replicated to more edge partitions to form the triplets) and will also increase communication. Unless there's something to be gained from higher parallelism, this will worsen performance. Additionally, going from no parallelism to some parallelism will incur the extra cost of task communication via shuffles. Parallelism has two benefits: it allows edge scans and aggregations to proceed in parallel, and it enables the graph to be stored across many machines. For small graphs, the slight performance gain due to parallelism is vastly outweighed by the cost of inter-process communication and shuffling to disk, and distributed storage is not necessary since the graph fits on a single machine. There are single-machine graph processing systems such as X-Stream [1] and GraphChi [2] that optimize performance for these kinds of graphs. However, parallelism becomes necessary for larger graphs with hundreds of millions of edges or large amounts of associated vertex and edge data. GraphX is designed for this scale of data. Ankur [1] http://infoscience.epfl.ch/record/188535/files/paper.pdf [2] http://graphlab.org/projects/graphchi.html --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org