Hi! We are executing the PageRank example from the Spark java examples package on a very large input graph. The code is available here. (Spark's github repo).
During the execution, the framework generates huge amount of intermediate data per each iteration (i.e. the contribs RDD). The intermediate data is temporary, but Spark does not clear the intermediate data of previous iterations. That is to say, if we are in the middle of 20th iteration, all of the temporary data of all previous iterations (iteration 0 to 19) are still kept in the tmp directory. As a result, the tmp directory grows linearly. It seems rational to keep the data from only the previous iteration, because if the current iteration fails, the job can be continued using the intermediate data from the previous iteration. Anyways, why does it keep the intermediate data for ALL previous iterations??? How can we enforce Spark to clear these intermediate data during the execution of job? Kind regards, Ali hadian