We are using "spark-1.4.1-bin-hadoop2.4" on mesos (not EMR) with s3 to read and write data and haven't noticed any inconsistencies with it, so 1 (mostly) and 2 definitely should not be a problem. Regarding 3, are you setting the file system impl in spark config?
sparkContext.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem"); And I have these dependencies if that helps. <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.4.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.4.1</version> </dependency> -Utkarsh On Mon, Sep 21, 2015 at 7:13 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Amit, > > Have you looked at Amazon EMR? Most people using EMR use s3 for > persistency (both as input and output of spark jobs). > > Best Regards, > > Jerry > > Sent from my iPhone > > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh <a...@yelp.com> wrote: > > > A lot of places in the documentation mention using s3 for checkpointing, > however I haven't found any examples or concrete evidence of anyone having > done this. > > 1. Is this a safe/reliable option given the read-after-write > consistency for PUTS in s3? > 2. Is s3 access broken for hadoop 2.6 (SPARK-7442 > <https://issues.apache.org/jira/browse/SPARK-7442>)? If so, is it > viable in 2.4? > 3. Related to #2. I did try providing hadoop-aws-2.6.0.jar while > submitting the job and got the following stack trace. Is there a fix? > > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: > Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated > at java.util.ServiceLoader.fail(ServiceLoader.java:224) > at java.util.ServiceLoader.access$100(ServiceLoader.java:181) > at > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377) > at java.util.ServiceLoader$1.next(ServiceLoader.java:445) > at > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2563) > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1354) > at org.apache.spark.SparkContext.addFile(SparkContext.scala:1332) > at > org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:475) > at > org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:475) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.apache.spark.SparkContext.<init>(SparkContext.scala:475) > at > org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) > at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:214) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NoClassDefFoundError: > com/amazonaws/AmazonServiceException > at java.lang.Class.getDeclaredConstructors0(Native Method) > at java.lang.Class.privateGetDeclaredConstructors(Class.java:2585) > at java.lang.Class.getConstructor0(Class.java:2885) > at java.lang.Class.newInstance(Class.java:350) > at > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373) > ... 27 more > Caused by: java.lang.ClassNotFoundException: > com.amazonaws.AmazonServiceException > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 32 more > > Thanks! > Amit > > -- Thanks, -Utkarsh