Actually another improvement would be to use something like compressed sparse row encoding which can be used to store A and A^T relatively efficiently (I think using 5 arrays instead of 6). There is an option to also be more cache aware using something like a block compressed sparse row encoding.
Ankur also at some point looked at renumbering the vertex ids on each edge to be consecutive with respect to the edge partition. This is a pretty common technique in most graph processing systems but in our early experiments we didn’t see much of a gain. In theory, however this should allow for lower bit precision encoding and a more dense representation and eliminate the need to use a hash lookup when joining vertices with edges. Joey > On Feb 23, 2016, at 1:13 PM, Adnan Haider <ahaid...@hawk.iit.edu> wrote: > > Hi > I have created a jira for this issue here. > <https://issues.apache.org/jira/browse/SPARK-13460?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC> > As for the pull request, my implementation is based on removing localSrcIds > and storing an array of offsets into localDstIds. I am running into issues > with this method when testing operations which create partitions from > existing edge partitions. The other method for implementing this would be to > use a hashmap from local id to offset/length pairs. This seems to work fine > but there is more storage overhead associated with this since each source > vertex requires space for three integers. > > Thanks, Adnan Haider > B.S Candidate, Computer Science > Illinois Institute of Technology > www.adnanhaider.com <http://www.adnanhaider.com/> > > On Mon, Feb 22, 2016 at 8:16 AM, Adnan Haider <ahaid...@hawk.iit.edu > <mailto:ahaid...@hawk.iit.edu>> wrote: > Yes, sounds good. I can submit the pull request. > > On 22 Feb 2016 00:35, "Reynold Xin" <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > + Joey > > We think this is worth doing. Are you interested in submitting a pull request? > > > On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ahaid...@hawk.iit.edu > <mailto:ahaid...@hawk.iit.edu>> wrote: > Hi, > I have been looking through the GraphX source code, dissecting the reason > for its high memory consumption compared to the on-disk size of the graph. I > have found that there may be room to reduce the memory footprint of the > graph structures. I think the biggest savings can come from the localSrcIds > and localDstIds in EdgePartitions. > > In particular, instead of storing both a source and destination local ID for > each edge, we could store only the destination id. For example after sorting > edges by global source id, we can map each of the source vertices first to > local values followed by unmapped global destination ids. This would make > localSrcIds sorted starting from 0 to n, where n is the number of distinct > global source ids. Then instead of actually storing the local source id for > each edge, we can store an array of size n, with each element storing an > index into localDstIds. From my understanding, this would also eliminate > the need for storing an index for indexed scanning, since each element in > localSrcIds would be the start of a cluster. From some extensive testing, > this along with some delta encoding strategies on localDstIds and the > mapping structures can reduce memory consumption of the graph by nearly > half. > > However, I am not entirely sure if there is any reason for storing both > localSrcIds and localDstIds for each edge in terms of integration of future > functionalities, such as graph mutations. I noticed there was another post > similar to this one as well, but it had not replies. > > The idea is quite similar to Netflix graph library > <https://github.com/Netflix/netflix-graph > <https://github.com/Netflix/netflix-graph>> and would be happy to open a > jira on this issue with partial improvements. But, I may not be completely > correct with my thinking! > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html > > <http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html> > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > <mailto:dev-unsubscr...@spark.apache.org> > For additional commands, e-mail: dev-h...@spark.apache.org > <mailto:dev-h...@spark.apache.org> > >