Re: Role-based S3 access outside of EMR

Steve Loughran Sun, 14 Aug 2016 13:15:52 -0700

On 29 Jul 2016, at 00:07, Everett Anderson 
<ever...@nuna.com.INVALID<mailto:ever...@nuna.com.invalid>> wrote:


Hey,

Just wrapping this up --

I ended up following the 
instructions<https://spark.apache.org/docs/1.6.2/building-spark.html> to build 
a custom Spark release with Hadoop 2.7.2, stealing from Steve's SPARK-7481 PR a 
bit, in order to get Spark 1.6.2 + Hadoop 2.7.2 + the hadoop-aws library (which 
pulls in the proper AWS Java SDK dependency).

Now that there's an official Spark 2.0 + Hadoop 2.7.x release, this is probably 
no longer necessary, but I haven't tried it, yet.


you still need need to get the hadoop-aws and compatible JARs into your lib 
dir; the SPARK-7481 patch does that and gets the hadoop-aws JAR it into 
spark-assembly JAR, something which isn't directly relevant for spark 2.

The PR is still tagged as WiP pending the release of Hadoop 2.7.3, which will 
swallow classload exceptions when enumerating filesystem clients declared in 
JARs ... without that the presence of hadoop-aws or hadoop-azure on the 
classpath *without the matching amazon or azure JARs* will cause startup to 
fail.


With the custom release, s3a paths work fine with EC2 role credentials without 
doing anything special. The only thing I had to do was to add this extra --conf 
flag to spark-submit in order to write to encrypted S3 buckets --

    --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256


I'd really like to know what performance difference you are seeing working with 
server-side encryption and different file formats; can you do any tests using 
encrypted and unencrypted copies of the same datasets and see how the times 
come out?


Full instructions for building on Mac are here:

1) Download the Spark 1.6.2 source from https://spark.apache.org/downloads.html

2) Install R

brew tap homebrew/science
brew install r

3) Set JAVA_HOME and the MAVEN_OPTS as in the instructions

4) Modify the root pom.xml to add a hadoop-2.7 profile (mostly stolen from 
Spark 2.0)

    <profile>
      <id>hadoop-2.7</id>
      <properties>
        <hadoop.version>2.7.2</hadoop.version>
        <jets3t.version>0.9.3</jets3t.version>
        <zookeeper.version>3.4.6</zookeeper.version>
        <curator.version>2.6.0</curator.version>
      </properties>
      <dependencyManagement>
        <dependencies>
          <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>${hadoop.version}</version>
            <scope>${hadoop.deps.scope}</scope>
            <exclusions>
              <exclusion>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
              </exclusion>
              <exclusion>
                <groupId>commons-logging</groupId>
                <artifactId>commons-logging</artifactId>
              </exclusion>
            </exclusions>
          </dependency>
        </dependencies>
      </dependencyManagement>
    </profile>

5) Modify core/pom.xml to include the corresponding hadoop-aws and AWS SDK libs

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
        </exclusion>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

6) Build with

./make-distribution.sh --name custom-hadoop-2.7-2-aws-s3a --tgz -Psparkr 
-Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn

Re: Role-based S3 access outside of EMR

Reply via email to