I have a recursive algorithm that performs a few jobs on successively smaller RDDs, and then a few more jobs on successively larger RDDs as the recursion unwinds, resulting in a somewhat deeply-nested (a few dozen levels) RDD lineage.
I am observing significant delays starting jobs while the MapOutputTrackerMaster calculates the sizes of the output statuses for all previous shuffles. By the end of my algorithm's execution, the driver spends about a minute doing this before each job, during which time my entire cluster is sitting idle. This output-status info is the same every time it computes it, no executors have joined or left the cluster. In this gist <https://gist.github.com/ryan-williams/445ef8736a688bd78edb#file-job-108> you can see two jobs stalling for almost a minute each between "Starting job:" and "Got job"; with larger input datasets my RDD lineages and this latency would presumably only grow. Additionally, the "DAG Visualization" on the job page of the web UI shows a huge horizontal-scrolling lineage of thousands of stages, indicating that the driver is tracking far more information than would seem necessary. I'm assuming the short answer is that I need to truncate RDDs' lineage, and the only way to do that is by checkpointing them to disk. I've done that and it avoids this issue, but means that I am now serializing my entire dataset to disk dozens of times during the course of execution, which feels unnecessary/wasteful. Is there a better way to deal with this scenario? Thanks, -Ryan