S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that as the person who mentored in all the patches for it between Hadoop 2.6 & 2.7
you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in your code -Hadoop 2.6.0 doesn't have any of the HADOOP-11571 patches in -Hadoop 2.7.0 does, but I'd hold off until 2.7.1 comes out. We may fix a couple of late-breaking issues that are surfacing related to reads of many-GB files & error recovery -HDP2.2 doesn't mention S3a and if anyone asks we'll say "don't" -CDH5.3 does mention S3a, but if anyone asks me I'll say "don't" -I don't know about CDH5.4 See also https://issues.apache.org/jira/browse/HADOOP-11694 ; things to be aware of On 23 Apr 2015, at 00:04, Daniel Mahler <dmah...@gmail.com<mailto:dmah...@gmail.com>> wrote: I would like to easily launch a cluster that supports s3a file systems. if I launch a cluster with `spark-ec2 --hadoop-major-version=2`, what determines the minor version of hadoop? Does it depend on the spark version being launched? Are there other allowed values for --hadoop-major-version besides 1 and 2? you can build your own spark binaries: mvn install -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 -Dhbase.profile=hadoop2 -DskipTests make sure that hadoop-aws and its dependencies are in there if you do that, and find s3a problems, try to isolate them between spark and hadoop and file the JIRAs against the relevant project -Steve