[ https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595 ]
Michał Wesołowski commented on SPARK-16478: ------------------------------------------- If you run code that I provided on databrics you can see that without materializing graph that is returned simple count on vertices takes about 20 minutes, whereas strongly connected components runs 2 minutes. I tried to us it on some real data and I wasn't able to save the result because of this. After materializing graph with every iteration I can save results with no problem. Materializing only within outside loop caused less severe problems but wasn't sufficient. > strongly connected components doesn't cache returned RDD > -------------------------------------------------------- > > Key: SPARK-16478 > URL: https://issues.apache.org/jira/browse/SPARK-16478 > Project: Spark > Issue Type: Bug > Components: GraphX > Affects Versions: 1.6.2 > Reporter: Michał Wesołowski > > Strongly Connected Components algorithm caches intermediary RDD's but doesn't > cache the one that is going to be returned. With large enough graph comparing > to available memory when one tries to take action on returned RDD whole RDD > has to be computed from scratch which takes much more time than > StronglyConnectedComponents alone . > I managed to replicate the issue on databrics platform. > [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html] > is notebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org