i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage.
the trouble is, i am not sure how to proceed. rdd.checkpoint() seems like the obvious candidate to force my computations to write to hdfs for intermediate data and cut the lineage, but rdd.checkpoint() does not actually trigger a job. rdd.checkpoint() runs after some other action triggered a job, leading to recomputation. the suggestion in the docs is to do: rdd.cache(); rdd.checkpoint() but that wont work for me since the data does not fit in memory. instead i could do: rdd.persist(StorageLevel.DISK_ONLY_2); rdd.checkpoint() but that leads to the data being written to disk twice in a row, which seems wasteful. so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. any ideas? i feel like i am missing something obvious. or i am running yet again into spark's historical in-memory bias?