spark disk-to-disk

Koert Kuipers Sun, 22 Mar 2015 18:04:22 -0700

i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.


the trouble is, i am not sure how to proceed.

rdd.checkpoint() seems like the obvious candidate to force my computations
to write to hdfs for intermediate data and cut the lineage, but
rdd.checkpoint() does not actually trigger a job. rdd.checkpoint() runs
after some other action triggered a job, leading to recomputation.

the suggestion in the docs is to do:
rdd.cache(); rdd.checkpoint()
but that wont work for me since the data does not fit in memory.

instead i could do:
rdd.persist(StorageLevel.DISK_ONLY_2); rdd.checkpoint()
but that leads to the data being written to disk twice in a row, which
seems wasteful.

so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile(...)
but that seems like a rather broken abstraction.

any ideas? i feel like i am missing something obvious. or i am running yet
again into spark's historical in-memory bias?

spark disk-to-disk

Reply via email to