+ Joey We think this is worth doing. Are you interested in submitting a pull request?
On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ahaid...@hawk.iit.edu> wrote: > Hi, > I have been looking through the GraphX source code, dissecting the reason > for its high memory consumption compared to the on-disk size of the graph. > I > have found that there may be room to reduce the memory footprint of the > graph structures. I think the biggest savings can come from the localSrcIds > and localDstIds in EdgePartitions. > > In particular, instead of storing both a source and destination local ID > for > each edge, we could store only the destination id. For example after > sorting > edges by global source id, we can map each of the source vertices first to > local values followed by unmapped global destination ids. This would make > localSrcIds sorted starting from 0 to n, where n is the number of distinct > global source ids. Then instead of actually storing the local source id for > each edge, we can store an array of size n, with each element storing an > index into localDstIds. From my understanding, this would also eliminate > the need for storing an index for indexed scanning, since each element in > localSrcIds would be the start of a cluster. From some extensive testing, > this along with some delta encoding strategies on localDstIds and the > mapping structures can reduce memory consumption of the graph by nearly > half. > > However, I am not entirely sure if there is any reason for storing both > localSrcIds and localDstIds for each edge in terms of integration of future > functionalities, such as graph mutations. I noticed there was another post > similar to this one as well, but it had not replies. > > The idea is quite similar to Netflix graph library > <https://github.com/Netflix/netflix-graph> and would be happy to open a > jira on this issue with partial improvements. But, I may not be completely > correct with my thinking! > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >