Thanks. I tried persist() on the RDD. The runtimes appear to have doubled now (without persist() it was ~7s per iteration and now its ~15s). I am running standalone Spark on a 8-core machine. Any thoughts on why the increase in runtime?
On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid <iras...@cloudera.com> wrote: > > val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER) > // or whatever persistence makes more sense for you ... > while(true) { > val res = grouped.flatMap(F) > res.collect.foreach(func) > if(criteria) > break > } > > On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan <kvi...@vt.edu> > wrote: > >> Hi, >> >> I have the following use case. >> >> (1) I have an RDD of edges of a graph (say R). >> (2) do a groupBy on R (by say source vertex) and call a function F on >> each group. >> (3) collect the results from Fs and do some computation >> (4) repeat the above steps until some criteria is met >> >> In (2), the groups are always going to be the same (since R is grouped by >> source vertex). >> >> Question: >> Is R distributed every iteration (when in (2)) or is it distributed only >> once when it is created? >> >> A sample code snippet is below. >> >> while(true) { >> val res = R.groupBy[VertexId](G).flatMap(F) >> res.collect.foreach(func) >> if(criteria) >> break >> } >> >> Since the groups remain the same, what is the best way to go about >> implementing the above logic? >> > >