GraphX EdgePartition format

Daniel Margo Fri, 06 Nov 2015 03:35:37 -0800

I was looking through the GraphX source and noticed that the topology of an
EdgePartition is a triplet of source, destination, and data columns --
essentially a COO sparse matrix -- sorted by source, and equipped with an
index from each (global) vertex ID to the start of its (local) source
cluster. This index provides efficient local neighborhood lookup.


Given that the columns are source-sorted, is there a reason that the
duplicate values in the source column are not efficiently packed, as in e.g.
a CSR sparse matrix? That is, replace every source cluster with a single
source value plus a length. Furthermore, these source values would duplicate
the existing global2local index, so they can be removed entirely.

This is a common optimization in sparse matrix systems and I recall (perhaps
incorrectly) that GraphLab used this format -- is there a reason that GraphX
does not?
-dwm




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-EdgePartition-format-tp15020.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

GraphX EdgePartition format

Reply via email to