Re: Flink + S3

Robert Metzger Wed, 20 Apr 2016 06:34:26 -0700

Hi Michael-Keith,

Welcome to the Flink community!


Let me try answer your question regarding the "best" deployment options:
>From what I see from the mailing list, most of our users are using one of
the big hadoop distributions (including Amazon EMR) with YARN installed.
Having YARN makes things quite comfortable because its taking care of
restarting JVMs in failure cases, deployment of the flink jars, security
(if used), ...

Flink might look tightly coupled to Hadoop, because we ship it with
different Hadoop versions, but that's mostly for convenience reasons. You
don't need to install anything from the Hadoop project to run Flink.
Its just that in the past, almost all users were using Hadoop, so it was
not an issue.

I would not install YARN on the cluster just for running Flink.

How are you running Kafka in your cluster?
I think you can run Flink in a very similar way. The only difficult part is
probably the failure recovery: When a Flink JVM crashes, you want it to be
restarted by some service on your server.
I've seen users which were using OS tools like upstart to ensure the Flink
TaskManagers are always running.

Regarding the puppet module: The Apache Bigtop project (basically an open
source hadoop distro) is currently adding support for Flink:
https://issues.apache.org/jira/browse/BIGTOP-1927. They'll create deb and
rpm packages, puppet scripts and testing for Flink.

Maybe they can use your puppet code as a reference.

Independent of their effort, I think it would be great if you publish your
puppet module.

Regarding Mesos: There are plans to integrate Flink with Mesos, I don't
think it'll make it into 1.1 but 1.2 seems realistic.


Regards,
Robert



On Wed, Apr 20, 2016 at 1:35 AM, Michael-Keith Bernard <
mkbern...@opentable.com> wrote:

> Hey Till & Ufuk,
>
> We're running on self-managed EC2 instances (and we'll eventually have a
> mirror cluster in our colo). The provided documentation notes that for
> Hadoop 2.6, we'd need such-and-such version of hadoop-aws and guice on the
> CP. If I wanted to instead use Hadoop 2.7, which versions of those
> dependencies should I get? And how can I look that up myself? The pom file
> for hadoop-aws[1] doesn't mention a specific dependency on Guice, so I'm
> curious how the author of that documentation knew exactly the dependencies
> and versions required.
>
> Let me switch my questioning slightly:
>
> What is the best (most widely supported, most common, easiest to use,
> easiest to scale, etc) way to deploy Flink today? I've been operating under
> the assumption that, since we have no existing Hadoop infrastructure, the
> path of least resistance is a stand-alone cluster. However it seems like
> Flink is still relatively tightly coupled to the Hadoop platform, so maybe
> I would be better off switching to Hadoop + YARN? Our requirements are
> simple (for now):
>
> Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode
> computation
>
> If the answer turns out to be that YARN is the best path forward for us,
> do you have any recommendations on how to get started building a minimal,
> but production ready Hadoop cluster suitable for Flink? Ambari looks
> amazing, so barring feedback to the contrary I'll probably be investing
> time looking at that first.
>
> Finally, any relevant book recommendations? :) I'm extremely excited about
> this project, so all the feedback I can get is highly welcome and highly
> appreciated!
>
> Cheers,
> Michael-Keith
>
> P.S. Is there planned support for Mesos as an alternative scheduler to
> YARN?
>
> [1]:
> http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom
>
> ________________________________________
> From: Ufuk Celebi <u...@apache.org>
> Sent: Tuesday, April 19, 2016 2:30 AM
> To: user@flink.apache.org
> Subject: Re: Flink + S3
>
> Hey Michael-Keith,
>
> are you running self-managed EC2 instances or EMR?
>
> In addition to what Till said:
>
> We tried to document this here as well:
>
> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency
>
> Does this help? You don't need to really install Hadoop, but only
> provide the configuration and the S3 FileSystem code on your
> classpath.
>
> If you use EMR + Flink on YARN, it should work out of the box.
>
> – Ufuk
>
> On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
> > Hi Michael-Keith,
> >
> > you can use S3 as the checkpoint directory for the filesystem state
> backend.
> > This means that whenever a checkpoint is performed the state data will be
> > written to this directory.
> >
> > The same holds true for the zookeeper recovery storage directory. This
> > directory will contain the submitted and not yet finished jobs as well as
> > some meta data for the checkpoints. With this information it is possible
> to
> > restore running jobs if the job manager dies.
> >
> > As far as I know, Flink relies on Hadoop's file system wrapper classes to
> > support S3. Flink has built in support for hdfs, maprfs and the local
> file
> > system. For everything else, Flink tries to find a Hadoop class.
> Therefore,
> > I fear that you need at least Hadoop's s3 filesystem class in your
> classpath
> > and a file called core-site.xml or hdfs-site.xml which is stored at a
> > location specified by fs.hdfs.hdfsdefault in Flink's configuration. And
> in
> > one of these files you have to create the xml tag to specify the class.
> But
> > the easiest way would be to simply install Hadoop.
> >
> > I'm not aware of any puppet scripts but I might miss something here. If
> you
> > should complete a puppet script, then it would definitely be a valuable
> > addition to Flink :-)
> >
> > Cheers,
> > Till
> >
> > On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> > <mkbern...@opentable.com> wrote:
> >>
> >> Hello Flink Users!
> >>
> >> I'm a Flink newbie at the early stages of deploying our first Flink
> >> cluster into production and I have a few questions about wiring up Flink
> >> with S3:
> >>
> >> * We are going to use the HA configuration[1] from day one (we have
> >> existing zk infrastructure already). Can S3 be used as a state backend
> for
> >> the Job Manager? The documentation talks about using S3 as a state
> backend
> >> for TM[2] (and in particular for streaming), but I'm wondering if it's a
> >> suitable backend for the JM as well.
> >>
> >> * How do I configure S3 for Flink when I don't already have an existing
> >> Hadoop cluster? The documentation references the Hadoop configuration
> >> manifest[3], which kind of implies to me that I must already be running
> >> Hadoop (or at least have a properly configured Hadoop cluster). Is
> there an
> >> example somewhere of using S3 as a storage backend for a standalone
> cluster?
> >>
> >> * Bonus: I'm writing a Puppet module for installing/configuring/managing
> >> Flink in stand alone mode with an existing zk cluster. Are there any
> >> existing modules for this (I didn't find anything in the forge)? Would
> >> others in the community be interested if we added our module to the
> forge
> >> once complete?
> >>
> >> Thanks so much for your time and consideration. We look forward to using
> >> Flink in production!
> >>
> >> Cheers,
> >> Michael-Keith
> >>
> >> [1]:
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
> >>
> >> [2]:
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
> >>
> >> [3]:
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
> >
> >
>

Re: Flink + S3

Reply via email to