Yes, sounds good. I can submit the pull request. On 22 Feb 2016 00:35, "Reynold Xin" <r...@databricks.com> wrote:
> + Joey > > We think this is worth doing. Are you interested in submitting a pull > request? > > > On Sat, Feb 20, 2016 at 8:05 PM ahaider3 <ahaid...@hawk.iit.edu> wrote: > >> Hi, >> I have been looking through the GraphX source code, dissecting the reason >> for its high memory consumption compared to the on-disk size of the >> graph. I >> have found that there may be room to reduce the memory footprint of the >> graph structures. I think the biggest savings can come from the >> localSrcIds >> and localDstIds in EdgePartitions. >> >> In particular, instead of storing both a source and destination local ID >> for >> each edge, we could store only the destination id. For example after >> sorting >> edges by global source id, we can map each of the source vertices first to >> local values followed by unmapped global destination ids. This would make >> localSrcIds sorted starting from 0 to n, where n is the number of distinct >> global source ids. Then instead of actually storing the local source id >> for >> each edge, we can store an array of size n, with each element storing an >> index into localDstIds. From my understanding, this would also eliminate >> the need for storing an index for indexed scanning, since each element in >> localSrcIds would be the start of a cluster. From some extensive testing, >> this along with some delta encoding strategies on localDstIds and the >> mapping structures can reduce memory consumption of the graph by nearly >> half. >> >> However, I am not entirely sure if there is any reason for storing both >> localSrcIds and localDstIds for each edge in terms of integration of >> future >> functionalities, such as graph mutations. I noticed there was another post >> similar to this one as well, but it had not replies. >> >> The idea is quite similar to Netflix graph library >> <https://github.com/Netflix/netflix-graph> and would be happy to open a >> jira on this issue with partial improvements. But, I may not be completely >> correct with my thinking! >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-Encoding-to-reduce-GraphX-s-static-graph-memory-consumption-tp16373.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >>