You can use spark 1.5.1 with no hadoop and hadoop 2.7.1.. Hadoop 2.7.1 is more mature for s3a access. You also need to set hadoop tools dir into hadoop classpath...
Raghav On Oct 16, 2015 1:09 AM, "Scott Reynolds" <sreyno...@twilio.com> wrote: > We do not use EMR. This is deployed on Amazon VMs > > We build Spark with Hadoop-2.6.0 but that does not include the s3a > filesystem nor the Amazon AWS SDK > > On Thu, Oct 15, 2015 at 12:26 PM, Spark Newbie <sparknewbie1...@gmail.com> > wrote: > >> Are you using EMR? >> You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster. >> And that brings s3a jars to the worker nodes and it becomes available to >> your application. >> >> On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds <sreyno...@twilio.com> >> wrote: >> >>> List, >>> >>> Right now we build our spark jobs with the s3a hadoop client. We do this >>> because our machines are only allowed to use IAM access to the s3 store. We >>> can build our jars with the s3a filesystem and the aws sdk just fine and >>> this jars run great in *client mode*. >>> >>> We would like to move from client mode to cluster mode as that will >>> allow us to be more resilient to driver failure. In order to do this either: >>> 1. the jar file has to be on worker's local disk >>> 2. the jar file is in shared storage (s3a) >>> >>> We would like to put the jar file in s3 storage, but when we give the >>> jar path as s3a://......, the worker node doesn't have the hadoop s3a and >>> aws sdk in its classpath / uber jar. >>> >>> Other then building spark with those two dependencies, what other >>> options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a >>> thing. >>> >>> Need to get s3a access to both the master (so that we can log spark >>> event log to s3) and to the worker processes (driver, executor). >>> >>> Looking for ideas before just adding the dependencies to our spark build >>> and calling it a day. >>> >> >> >