I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system.
Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint directory is lost? Thanks! On 18 September 2015 at 05:47, Tathagata Das <t...@databricks.com> wrote: > I dont think it would work with multipart upload either. The file is not > visible until the multipart download is explicitly closed. So even if each > write a part upload, all the parts are not visible until the multiple > download is closed. > > TD > > On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> > On 17 Sep 2015, at 21:40, Tathagata Das <t...@databricks.com> wrote: >> > >> > Actually, the current WAL implementation (as of Spark 1.5) does not >> work with S3 because S3 does not support flushing. Basically, the current >> implementation assumes that after write + flush, the data is immediately >> durable, and readable if the system crashes without closing the WAL file. >> This does not work with S3 as data is durable only and only if the S3 file >> output stream is cleanly closed. >> > >> >> >> more precisely, unless you turn multipartition uploads on, the S3n/s3a >> clients Spark uses *doesn't even upload anything to s3*. >> >> It's not a filesystem, and you have to bear that in mind. >> >> Amazon's own s3 client used in EMR behaves differently; it may be usable >> as a destination (I haven't tested) >> >> >