Hey Till & Ufuk, We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster in our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such version of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions of those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1] doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation knew exactly the dependencies and versions required.
Let me switch my questioning slightly: What is the best (most widely supported, most common, easiest to use, easiest to scale, etc) way to deploy Flink today? I've been operating under the assumption that, since we have no existing Hadoop infrastructure, the path of least resistance is a stand-alone cluster. However it seems like Flink is still relatively tightly coupled to the Hadoop platform, so maybe I would be better off switching to Hadoop + YARN? Our requirements are simple (for now): Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode computation If the answer turns out to be that YARN is the best path forward for us, do you have any recommendations on how to get started building a minimal, but production ready Hadoop cluster suitable for Flink? Ambari looks amazing, so barring feedback to the contrary I'll probably be investing time looking at that first. Finally, any relevant book recommendations? :) I'm extremely excited about this project, so all the feedback I can get is highly welcome and highly appreciated! Cheers, Michael-Keith P.S. Is there planned support for Mesos as an alternative scheduler to YARN? [1]: http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom ________________________________________ From: Ufuk Celebi <u...@apache.org> Sent: Tuesday, April 19, 2016 2:30 AM To: user@flink.apache.org Subject: Re: Flink + S3 Hey Michael-Keith, are you running self-managed EC2 instances or EMR? In addition to what Till said: We tried to document this here as well: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency Does this help? You don't need to really install Hadoop, but only provide the configuration and the S3 FileSystem code on your classpath. If you use EMR + Flink on YARN, it should work out of the box. – Ufuk On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Michael-Keith, > > you can use S3 as the checkpoint directory for the filesystem state backend. > This means that whenever a checkpoint is performed the state data will be > written to this directory. > > The same holds true for the zookeeper recovery storage directory. This > directory will contain the submitted and not yet finished jobs as well as > some meta data for the checkpoints. With this information it is possible to > restore running jobs if the job manager dies. > > As far as I know, Flink relies on Hadoop's file system wrapper classes to > support S3. Flink has built in support for hdfs, maprfs and the local file > system. For everything else, Flink tries to find a Hadoop class. Therefore, > I fear that you need at least Hadoop's s3 filesystem class in your classpath > and a file called core-site.xml or hdfs-site.xml which is stored at a > location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in > one of these files you have to create the xml tag to specify the class. But > the easiest way would be to simply install Hadoop. > > I'm not aware of any puppet scripts but I might miss something here. If you > should complete a puppet script, then it would definitely be a valuable > addition to Flink :-) > > Cheers, > Till > > On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard > <mkbern...@opentable.com> wrote: >> >> Hello Flink Users! >> >> I'm a Flink newbie at the early stages of deploying our first Flink >> cluster into production and I have a few questions about wiring up Flink >> with S3: >> >> * We are going to use the HA configuration[1] from day one (we have >> existing zk infrastructure already). Can S3 be used as a state backend for >> the Job Manager? The documentation talks about using S3 as a state backend >> for TM[2] (and in particular for streaming), but I'm wondering if it's a >> suitable backend for the JM as well. >> >> * How do I configure S3 for Flink when I don't already have an existing >> Hadoop cluster? The documentation references the Hadoop configuration >> manifest[3], which kind of implies to me that I must already be running >> Hadoop (or at least have a properly configured Hadoop cluster). Is there an >> example somewhere of using S3 as a storage backend for a standalone cluster? >> >> * Bonus: I'm writing a Puppet module for installing/configuring/managing >> Flink in stand alone mode with an existing zk cluster. Are there any >> existing modules for this (I didn't find anything in the forge)? Would >> others in the community be interested if we added our module to the forge >> once complete? >> >> Thanks so much for your time and consideration. We look forward to using >> Flink in production! >> >> Cheers, >> Michael-Keith >> >> [1]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability >> >> [2]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service >> >> [3]: >> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem > >