FYI, A colleague in the Spark team sat down to address the long-standing
and neglected "Add some Abortable.abort() interface for streams etc which
can be terminated"

https://issues.apache.org/jira/browse/HADOOP-16906

PR: https://github.com/apache/hadoop/pull/2667
With markdown to go with and some tuning of the API/S3A implementation.
https://github.com/apache/hadoop/pull/2684

We're happy with this -as well as working in the s3a stream it should work
with any object store whose output is becomes visible after close().
Obviously this excludes HDFS, file:// and the azure stores. Anything where
create() creates the file, hflush flushes to it etc.

For spark and similar, this will enable checkpointing direct to s3 or any
other store whose stream implements the same interface. You don't need to
write to a temp location, you can write to final destination, over the
existing data, but choose whether to actually complete or abort the write

Comments welcome

-Steve

Reply via email to