Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
Thanks Steve! FYI: S3 now supports GET-after-PUT consistency for new objects in all regions, including US Standard https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 14:56, Michal Čizmazia mailto:mici...@gmail.com>> wrote: To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach? nothing obvious. the blo

Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach? On 23 September 2015 at 02:10, Tathagata Das wrote: > Responses inline. > > > On Tue, Sep 22, 2015 at 8

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 07:10, Tathagata Das mailto:t...@databricks.com>> wrote: Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia mailto:mici...@gmail.com>> wrote: Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Yes. Because checkpoints are single files by itself, and d

Re: WAL on S3

2015-09-22 Thread Tathagata Das
Responses inline. On Tue, Sep 22, 2015 at 8:35 PM, Michal Čizmazia wrote: > Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? > > Yes. Because checkpoints are single files by itself, and does not require flush semantics to work. So S3 is fine. > Trying to answer this question, I looke

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
Can checkpoints be stored to S3 (via S3/S3A Hadoop URL)? Trying to answer this question, I looked into Checkpoint.getCheckpointFiles [1]. It is doing findFirstIn which would probably be calling the S3 LIST operation. S3 LIST is prone to eventual consistency [2]. What would happen when getCheckpoin

Re: WAL on S3

2015-09-22 Thread Tathagata Das
You can keep the checkpoints in the Hadoop-compatible file system and the WAL somewhere else using your custom WAL implementation. Yes, cleaning up the stuff gets complicated as it is not as easy as deleting off the checkpoint directory - you will have to clean up checkpoint directory as well as th

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1]. What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]: https://issues.apache.org/jira/browse/

Re: WAL on S3

2015-09-22 Thread Tathagata Das
1. Currently, the WAL can be used only with checkpointing turned on, because it does not make sense to recover from WAL if there is not checkpoint information to recover from. 2. Since the current implementation saves the WAL in the checkpoint directory, they share the fate -- if checkpoint direct

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system. Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint director

Re: WAL on S3

2015-09-18 Thread Tathagata Das
I dont think it would work with multipart upload either. The file is not visible until the multipart download is explicitly closed. So even if each write a part upload, all the parts are not visible until the multiple download is closed. TD On Fri, Sep 18, 2015 at 1:55 AM, Steve Loughran wrote:

Re: WAL on S3

2015-09-18 Thread Steve Loughran
> On 17 Sep 2015, at 21:40, Tathagata Das wrote: > > Actually, the current WAL implementation (as of Spark 1.5) does not work with > S3 because S3 does not support flushing. Basically, the current > implementation assumes that after write + flush, the data is immediately > durable, and readab

Re: WAL on S3

2015-09-17 Thread Tathagata Das
You could override the spark conf called "spark.streaming.receiver.writeAheadLog.class" with the class name. https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogUtils.scala#L30 On Thu, Sep 17, 2015 at 2:04 PM, Michal Čizmazia wrote: >

Re: WAL on S3

2015-09-17 Thread Michal Čizmazia
Please could you explain how to use pluggable WAL? After I implement the WriteAheadLog abstract class, how can I use it? I want to use it with a Custom Reliable Receiver. I am using Spark 1.4.1. Thanks! On 17 September 2015 at 16:40, Tathagata Das wrote: > Actually, the current WAL implementa

Re: WAL on S3

2015-09-17 Thread Tathagata Das
Actually, the current WAL implementation (as of Spark 1.5) does not work with S3 because S3 does not support flushing. Basically, the current implementation assumes that after write + flush, the data is immediately durable, and readable if the system crashes without closing the WAL file. This does

Re: WAL on S3

2015-09-17 Thread Ted Yu
I assume you don't use Kinesis. Are you running Spark 1.5.0 ? If you must use S3, is switching to Kinesis possible ? Cheers On Thu, Sep 17, 2015 at 1:09 PM, Michal Čizmazia wrote: > How to make Write Ahead Logs to work with S3? Any pointers welcome! > > It seems as a known issue: > https://iss

WAL on S3

2015-09-17 Thread Michal Čizmazia
How to make Write Ahead Logs to work with S3? Any pointers welcome! It seems as a known issue: https://issues.apache.org/jira/browse/SPARK-9215 I am getting this exception when reading write ahead log: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: