I was looking through the GraphX source and noticed that the topology of an EdgePartition is a triplet of source, destination, and data columns -- essentially a COO sparse matrix -- sorted by source, and equipped with an index from each (global) vertex ID to the start of its (local) source cluster. This index provides efficient local neighborhood lookup.
Given that the columns are source-sorted, is there a reason that the duplicate values in the source column are not efficiently packed, as in e.g. a CSR sparse matrix? That is, replace every source cluster with a single source value plus a length. Furthermore, these source values would duplicate the existing global2local index, so they can be removed entirely. This is a common optimization in sparse matrix systems and I recall (perhaps incorrectly) that GraphLab used this format -- is there a reason that GraphX does not? -dwm -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/GraphX-EdgePartition-format-tp15020.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org