On Wed, Apr 20, 2016 at 1:35 AM, Michael-Keith Bernard <mkbern...@opentable.com> wrote: > We're running on self-managed EC2 instances (and we'll eventually have a > mirror cluster in our colo). The provided documentation notes that for Hadoop > 2.6, we'd need such-and-such version of hadoop-aws and guice on the CP. If I > wanted to instead use Hadoop 2.7, which versions of those dependencies should > I get? And how can I look that up myself? The pom file for hadoop-aws[1] > doesn't mention a specific dependency on Guice, so I'm curious how the author > of that documentation knew exactly the dependencies and versions required.
Hey Michael-Keith, I think you meant Guava and not Guice. How to determine, which dependencies you need is quite a mess at the moment. It depends on a combination of 3 things: 1) the dependencies of hadoop-aws [1], 2) which S3 file system you use (in case of the docs org.apache.hadoop.fs.s3native.NativeS3FileSystem) [2], 3) what Flink shades away in its Hadoop dependencies [3] 1) hadoop-aws depends on hadoop-common (and other packages). hadoop-common is already part of Flink (including the fs.FileSystem classes etc.) 2) NativeS3FileSystem uses dependencies from hadoop-common like FileSystem and from hadoop-aws like Jets3tNativeFileSystemStore. The hadoop-common stuff is part of Flink and Jets3tNativeFileSystemStore is part of hadoop-aws. The big issue here is that other S3 FS implementations might work with the aws-java-sdk packages of hadoop-aws. 3) Flink shades Hadoop's Guava dependency away and that's why you need to add it manually to the CP. So, if you go for the suggested NativeS3FileSystem, you end up needing hadoop-aws and Guava. Of course, this might change in future versions of Flink and/or Hadoop. I will update the docs for the different versions of Flink and Hadoop for now and hope that this will help. :-( The easiest solution in the future would be that Flink comes with hadoop-aws, but I don't think that this is going to happen. – Ufuk [1] http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.6.0 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html#provide-s3-filesystem-dependency [3] https://github.com/apache/flink/blob/master/flink-shaded-hadoop/pom.xml