yea, they're all skipped, here's a gif <http://f.cl.ly/items/413l3k363u290U173W00/Screen%20Recording%202016-01-23%20at%2005.08%20PM.gif> scrolling through the DAG viz.
Thanks for the JIRA pointer, I'll keep an eye on that one! On Sat, Jan 23, 2016 at 4:53 PM Mark Hamstra <m...@clearstorydata.com> wrote: > Do all of those thousands of Stages end up being actual Stages that need > to be computed, or are the vast majority of them eventually "skipped" > Stages? If the latter, then there is the potential to modify the > DAGScheduler to avoid much of this behavior: > https://issues.apache.org/jira/browse/SPARK-10193 > https://github.com/apache/spark/pull/8427 > > On Sat, Jan 23, 2016 at 1:40 PM, Ryan Williams < > ryan.blake.willi...@gmail.com> wrote: > >> I have a recursive algorithm that performs a few jobs on successively >> smaller RDDs, and then a few more jobs on successively larger RDDs as the >> recursion unwinds, resulting in a somewhat deeply-nested (a few dozen >> levels) RDD lineage. >> >> I am observing significant delays starting jobs while the >> MapOutputTrackerMaster calculates the sizes of the output statuses for all >> previous shuffles. By the end of my algorithm's execution, the driver >> spends about a minute doing this before each job, during which time my >> entire cluster is sitting idle. This output-status info is the same every >> time it computes it, no executors have joined or left the cluster. >> >> In this gist >> <https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108> >> you can see two jobs stalling for almost a minute each between "Starting >> job:" and "Got job"; with larger input datasets my RDD lineages and this >> latency would presumably only grow. >> >> Additionally, the "DAG Visualization" on the job page of the web UI shows >> a huge horizontal-scrolling lineage of thousands of stages, indicating that >> the driver is tracking far more information than would seem necessary. >> >> I'm assuming the short answer is that I need to truncate RDDs' lineage, >> and the only way to do that is by checkpointing them to disk. I've done >> that and it avoids this issue, but means that I am now serializing my >> entire dataset to disk dozens of times during the course of execution, >> which feels unnecessary/wasteful. >> >> Is there a better way to deal with this scenario? >> >> Thanks, >> >> -Ryan >> > >