hi experts, I’m reporting a problem about spark graphx, I use zeppelin submit spark jobs, note that scala environment shares the same SparkContext, SQLContext instance, and I call Connected components algorithm to do some Business, found that every time when the job finished, some graph storage RDDs weren’t bean released, after several times there would be a lot of storage RDDs existing even through all the jobs have finished .
So I check the code of connectedComponents and find that may be a problem in Pregel.scala . when param graph has been cached, there isn’t any way to unpersist, so I add red font code to solve the problem def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (graph: Graph[VD, ED], initialMsg: A, maxIterations: Int = Int.MaxValue, activeDirection: EdgeDirection = EdgeDirection.Either) (vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { ...... var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() graph.unpersistVertices(blocking = false) graph.edges.unpersist(blocking = false) ...... } // end of apply I'm not sure if this is a bug, and thank you for your time, juntao