My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1].
What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]: https://issues.apache.org/jira/browse/SPARK-7056 On 22 September 2015 at 19:57, Tathagata Das <tathagata.das1...@gmail.com> wrote: > 1. Currently, the WAL can be used only with checkpointing turned on, > because it does not make sense to recover from WAL if there is not > checkpoint information to recover from. > > 2. Since the current implementation saves the WAL in the checkpoint > directory, they share the fate -- if checkpoint directory is deleted, then > both checkpoint info and WAL info is deleted. > > 3. Checkpointing is currently not pluggable. Why do do you want that? > > > > On Tue, Sep 22, 2015 at 4:53 PM, Michal Čizmazia <mici...@gmail.com> > wrote: > >> I am trying to use pluggable WAL, but it can be used only with >> checkpointing turned on. Thus I still need have a Hadoop-compatible file >> system. >> >> Is there something like pluggable checkpointing? >> >> Or can WAL be used without checkpointing? What happens when WAL is >> available but the checkpoint directory is lost? >> >> Thanks! >> >> >> On 18 September 2015 at 05:47, Tathagata Das <t...@databricks.com> wrote: >> >>> I dont think it would work with multipart upload either. The file is not >>> visible until the multipart download is explicitly closed. So even if each >>> write a part upload, all the parts are not visible until the multiple >>> download is closed. >>> >>> TD >>> >>> On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran <ste...@hortonworks.com> >>> wrote: >>> >>>> >>>> > On 17 Sep 2015, at 21:40, Tathagata Das <t...@databricks.com> wrote: >>>> > >>>> > Actually, the current WAL implementation (as of Spark 1.5) does not >>>> work with S3 because S3 does not support flushing. Basically, the current >>>> implementation assumes that after write + flush, the data is immediately >>>> durable, and readable if the system crashes without closing the WAL file. >>>> This does not work with S3 as data is durable only and only if the S3 file >>>> output stream is cleanly closed. >>>> > >>>> >>>> >>>> more precisely, unless you turn multipartition uploads on, the S3n/s3a >>>> clients Spark uses *doesn't even upload anything to s3*. >>>> >>>> It's not a filesystem, and you have to bear that in mind. >>>> >>>> Amazon's own s3 client used in EMR behaves differently; it may be >>>> usable as a destination (I haven't tested) >>>> >>>> >>> >> >