Actually another improvement would be to use something like compressed sparse 
row encoding which can be used to store A and A^T relatively efficiently (I 
think using 5 arrays instead of 6).  There is an option to also be more cache 
aware using something like a block compressed sparse row encoding.

Ankur also at some point looked at renumbering the vertex ids on each edge to 
be consecutive with respect to the edge partition.  This is a pretty common 
technique in most graph processing systems but in our early experiments we 
didn’t see much of a gain.   In theory, however this should allow for lower bit 
precision encoding and a more dense representation and eliminate the need to 
use a hash lookup when joining vertices with edges. 

Joey





> On Feb 23, 2016, at 1:13 PM, Adnan Haider <ahaid...@hawk.iit.edu> wrote:
> 
> Hi
> I have created a jira for this issue here. 
> <https://issues.apache.org/jira/browse/SPARK-13460?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC>
>  As for the pull request, my implementation is based on removing localSrcIds 
> and storing an array of offsets into localDstIds. I am running into issues 
> with this method when testing operations which create partitions from 
> existing edge partitions. The other method for implementing this would be to 
> use a hashmap from local id to offset/length pairs. This seems to work fine 
> but there is more storage overhead associated with this since each source 
> vertex requires space for three integers. 
> 
> Thanks, Adnan Haider
> B.S Candidate, Computer Science
> Illinois Institute of Technology 
> www.adnanhaider.com <http://www.adnanhaider.com/>
> 
> On Mon, Feb 22, 2016 at 8:16 AM, Adnan Haider <ahaid...@hawk.iit.edu 
> <mailto:ahaid...@hawk.iit.edu>> wrote:
> Yes, sounds good. I can submit the pull request.
> 
> On 22 Feb 2016 00:35, "Reynold Xin" <r...@databricks.com 
> <mailto:r...@databricks.com>> wrote:
> + Joey
> 
> We think this is worth doing. Are you interested in submitting a pull request?
> 
> 
> On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ahaid...@hawk.iit.edu 
> <mailto:ahaid...@hawk.iit.edu>> wrote:
> Hi,
> I have been looking through the GraphX source code, dissecting the reason
> for its high memory consumption compared to the on-disk size of the graph. I
> have found that there may be room to reduce the memory footprint of the
> graph structures. I think the biggest savings can come from the localSrcIds
> and localDstIds in EdgePartitions.
> 
> In particular, instead of storing both a source and destination local ID for
> each edge, we could store only the destination id. For example after sorting
> edges by global source id, we can map each of the source vertices first to
> local values followed by unmapped global destination ids. This would make
> localSrcIds sorted starting from 0 to n, where n is the number of distinct
> global source ids. Then instead of actually storing the local source id for
> each edge, we can store an array of size n, with each element storing an
> index into localDstIds.  From my understanding, this would also eliminate
> the need for storing an index for indexed scanning, since each element in
> localSrcIds would be the start of a cluster. From some extensive testing,
> this along with some delta encoding strategies on localDstIds and the
> mapping structures can reduce memory consumption of the graph by nearly
> half.
> 
> However, I am not entirely sure if there is any reason for storing both
> localSrcIds and localDstIds for each edge in terms of integration of future
> functionalities, such as graph mutations. I noticed there was another post
> similar to this one as well, but it had not replies.
> 
> The idea is quite similar to  Netflix graph library
> <https://github.com/Netflix/netflix-graph 
> <https://github.com/Netflix/netflix-graph>>   and would be happy to open a
> jira on this issue with partial improvements. But, I may not be completely
> correct with my thinking!
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html
>  
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html>
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
> 
> 

Reply via email to