Hi,
I have been looking through the GraphX source code, dissecting the reason
for its high memory consumption compared to the on-disk size of the graph. I
have found that there may be room to reduce the memory footprint of the
graph structures. I think the biggest savings can come from the localSrcIds
and localDstIds in EdgePartitions. 

In particular, instead of storing both a source and destination local ID for
each edge, we could store only the destination id. For example after sorting
edges by global source id, we can map each of the source vertices first to
local values followed by unmapped global destination ids. This would make
localSrcIds sorted starting from 0 to n, where n is the number of distinct
global source ids. Then instead of actually storing the local source id for
each edge, we can store an array of size n, with each element storing an
index into localDstIds.  From my understanding, this would also eliminate
the need for storing an index for indexed scanning, since each element in
localSrcIds would be the start of a cluster. From some extensive testing,
this along with some delta encoding strategies on localDstIds and the
mapping structures can reduce memory consumption of the graph by nearly
half. 

However, I am not entirely sure if there is any reason for storing both
localSrcIds and localDstIds for each edge in terms of integration of future
functionalities, such as graph mutations. I noticed there was another post
similar to this one as well, but it had not replies.

The idea is quite similar to  Netflix graph library
<https://github.com/Netflix/netflix-graph>   and would be happy to open a
jira on this issue with partial improvements. But, I may not be completely
correct with my thinking! 




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to