On 29 Jul 2016, at 00:07, Everett Anderson <ever...@nuna.com.INVALID<mailto:ever...@nuna.com.invalid>> wrote:
Hey, Just wrapping this up -- I ended up following the instructions<https://spark.apache.org/docs/1.6.2/building-spark.html> to build a custom Spark release with Hadoop 2.7.2, stealing from Steve's SPARK-7481 PR a bit, in order to get Spark 1.6.2 + Hadoop 2.7.2 + the hadoop-aws library (which pulls in the proper AWS Java SDK dependency). Now that there's an official Spark 2.0 + Hadoop 2.7.x release, this is probably no longer necessary, but I haven't tried it, yet. you still need need to get the hadoop-aws and compatible JARs into your lib dir; the SPARK-7481 patch does that and gets the hadoop-aws JAR it into spark-assembly JAR, something which isn't directly relevant for spark 2. The PR is still tagged as WiP pending the release of Hadoop 2.7.3, which will swallow classload exceptions when enumerating filesystem clients declared in JARs ... without that the presence of hadoop-aws or hadoop-azure on the classpath *without the matching amazon or azure JARs* will cause startup to fail. With the custom release, s3a paths work fine with EC2 role credentials without doing anything special. The only thing I had to do was to add this extra --conf flag to spark-submit in order to write to encrypted S3 buckets -- --conf spark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 I'd really like to know what performance difference you are seeing working with server-side encryption and different file formats; can you do any tests using encrypted and unencrypted copies of the same datasets and see how the times come out? Full instructions for building on Mac are here: 1) Download the Spark 1.6.2 source from https://spark.apache.org/downloads.html 2) Install R brew tap homebrew/science brew install r 3) Set JAVA_HOME and the MAVEN_OPTS as in the instructions 4) Modify the root pom.xml to add a hadoop-2.7 profile (mostly stolen from Spark 2.0) <profile> <id>hadoop-2.7</id> <properties> <hadoop.version>2.7.2</hadoop.version> <jets3t.version>0.9.3</jets3t.version> <zookeeper.version>3.4.6</zookeeper.version> <curator.version>2.6.0</curator.version> </properties> <dependencyManagement> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-aws</artifactId> <version>${hadoop.version}</version> <scope>${hadoop.deps.scope}</scope> <exclusions> <exclusion> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </exclusion> <exclusion> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> </exclusion> </exclusions> </dependency> </dependencies> </dependencyManagement> </profile> 5) Modify core/pom.xml to include the corresponding hadoop-aws and AWS SDK libs <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-aws</artifactId> <exclusions> <exclusion> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </exclusion> <exclusion> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> </exclusion> </exclusions> </dependency> 6) Build with ./make-distribution.sh --name custom-hadoop-2.7-2-aws-s3a --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn