Re: spark-ec2 s3a filesystem support and hadoop versions

Steve Loughran Fri, 24 Apr 2015 03:20:08 -0700

S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that 
as the person who mentored in all the patches for it between Hadoop 2.6 & 2.7


you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in 
your code

-Hadoop 2.6.0 doesn't have any of the HADOOP-11571 patches in
-Hadoop 2.7.0 does, but I'd hold off until 2.7.1 comes out. We may fix a couple 
of late-breaking issues that are surfacing related to reads of many-GB files & 
error recovery
-HDP2.2 doesn't mention S3a and if anyone asks we'll say "don't"
-CDH5.3 does mention S3a, but if anyone asks me I'll say "don't"
-I don't know about CDH5.4

See also https://issues.apache.org/jira/browse/HADOOP-11694 ; things to be 
aware of


On 23 Apr 2015, at 00:04, Daniel Mahler 
<dmah...@gmail.com<mailto:dmah...@gmail.com>> wrote:

I would like to easily launch a cluster that supports s3a file systems.

if I launch a cluster with `spark-ec2 --hadoop-major-version=2`,
what determines the minor version of hadoop?

Does it depend on the spark version being launched?

Are there other allowed values for --hadoop-major-version besides 1 and 2?


you can build your own spark binaries:

mvn install  -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 
-Dhadoop.version=2.7.0  -Dhbase.profile=hadoop2 -DskipTests

make sure that hadoop-aws and its dependencies are in there

if you do that, and find s3a problems, try to isolate them between spark and 
hadoop and file the JIRAs against the relevant project

-Steve

Re: spark-ec2 s3a filesystem support and hadoop versions

Reply via email to