Hey Michael-Keith, are you running self-managed EC2 instances or EMR?
In addition to what Till said: We tried to document this here as well: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency Does this help? You don't need to really install Hadoop, but only provide the configuration and the S3 FileSystem code on your classpath. If you use EMR + Flink on YARN, it should work out of the box. – Ufuk On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Michael-Keith, > > you can use S3 as the checkpoint directory for the filesystem state backend. > This means that whenever a checkpoint is performed the state data will be > written to this directory. > > The same holds true for the zookeeper recovery storage directory. This > directory will contain the submitted and not yet finished jobs as well as > some meta data for the checkpoints. With this information it is possible to > restore running jobs if the job manager dies. > > As far as I know, Flink relies on Hadoop's file system wrapper classes to > support S3. Flink has built in support for hdfs, maprfs and the local file > system. For everything else, Flink tries to find a Hadoop class. Therefore, > I fear that you need at least Hadoop's s3 filesystem class in your classpath > and a file called core-site.xml or hdfs-site.xml which is stored at a > location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in > one of these files you have to create the xml tag to specify the class. But > the easiest way would be to simply install Hadoop. > > I'm not aware of any puppet scripts but I might miss something here. If you > should complete a puppet script, then it would definitely be a valuable > addition to Flink :-) > > Cheers, > Till > > On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard > <mkbern...@opentable.com> wrote: >> >> Hello Flink Users! >> >> I'm a Flink newbie at the early stages of deploying our first Flink >> cluster into production and I have a few questions about wiring up Flink >> with S3: >> >> * We are going to use the HA configuration[1] from day one (we have >> existing zk infrastructure already). Can S3 be used as a state backend for >> the Job Manager? The documentation talks about using S3 as a state backend >> for TM[2] (and in particular for streaming), but I'm wondering if it's a >> suitable backend for the JM as well. >> >> * How do I configure S3 for Flink when I don't already have an existing >> Hadoop cluster? The documentation references the Hadoop configuration >> manifest[3], which kind of implies to me that I must already be running >> Hadoop (or at least have a properly configured Hadoop cluster). Is there an >> example somewhere of using S3 as a storage backend for a standalone cluster? >> >> * Bonus: I'm writing a Puppet module for installing/configuring/managing >> Flink in stand alone mode with an existing zk cluster. Are there any >> existing modules for this (I didn't find anything in the forge)? Would >> others in the community be interested if we added our module to the forge >> once complete? >> >> Thanks so much for your time and consideration. We look forward to using >> Flink in production! >> >> Cheers, >> Michael-Keith >> >> [1]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability >> >> [2]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service >> >> [3]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem > >