Re: Flink + S3

Michael-Keith Bernard Tue, 19 Apr 2016 16:36:03 -0700

Hey Till & Ufuk,

We're running on self-managed EC2 instances (and we'll eventually have a mirror 
cluster in our colo). The provided documentation notes that for Hadoop 2.6, 
we'd need such-and-such version of hadoop-aws and guice on the CP. If I wanted 
to instead use Hadoop 2.7, which versions of those dependencies should I get? 
And how can I look that up myself? The pom file for hadoop-aws[1] doesn't 
mention a specific dependency on Guice, so I'm curious how the author of that 
documentation knew exactly the dependencies and versions required.


Let me switch my questioning slightly:

What is the best (most widely supported, most common, easiest to use, easiest 
to scale, etc) way to deploy Flink today? I've been operating under the 
assumption that, since we have no existing Hadoop infrastructure, the path of 
least resistance is a stand-alone cluster. However it seems like Flink is still 
relatively tightly coupled to the Hadoop platform, so maybe I would be better 
off switching to Hadoop + YARN? Our requirements are simple (for now):

Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode 
computation

If the answer turns out to be that YARN is the best path forward for us, do you 
have any recommendations on how to get started building a minimal, but 
production ready Hadoop cluster suitable for Flink? Ambari looks amazing, so 
barring feedback to the contrary I'll probably be investing time looking at 
that first.

Finally, any relevant book recommendations? :) I'm extremely excited about this 
project, so all the feedback I can get is highly welcome and highly appreciated!

Cheers,
Michael-Keith

P.S. Is there planned support for Mesos as an alternative scheduler to YARN?

[1]: 
http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom

________________________________________
From: Ufuk Celebi <u...@apache.org>
Sent: Tuesday, April 19, 2016 2:30 AM
To: user@flink.apache.org
Subject: Re: Flink + S3

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?

In addition to what Till said:

We tried to document this here as well:
https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your
classpath.

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> Hi Michael-Keith,
>
> you can use S3 as the checkpoint directory for the filesystem state backend.
> This means that whenever a checkpoint is performed the state data will be
> written to this directory.
>
> The same holds true for the zookeeper recovery storage directory. This
> directory will contain the submitted and not yet finished jobs as well as
> some meta data for the checkpoints. With this information it is possible to
> restore running jobs if the job manager dies.
>
> As far as I know, Flink relies on Hadoop's file system wrapper classes to
> support S3. Flink has built in support for hdfs, maprfs and the local file
> system. For everything else, Flink tries to find a Hadoop class. Therefore,
> I fear that you need at least Hadoop's s3 filesystem class in your classpath
> and a file called core-site.xml or hdfs-site.xml which is stored at a
> location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in
> one of these files you have to create the xml tag to specify the class. But
> the easiest way would be to simply install Hadoop.
>
> I'm not aware of any puppet scripts but I might miss something here. If you
> should complete a puppet script, then it would definitely be a valuable
> addition to Flink :-)
>
> Cheers,
> Till
>
> On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> <mkbern...@opentable.com> wrote:
>>
>> Hello Flink Users!
>>
>> I'm a Flink newbie at the early stages of deploying our first Flink
>> cluster into production and I have a few questions about wiring up Flink
>> with S3:
>>
>> * We are going to use the HA configuration[1] from day one (we have
>> existing zk infrastructure already). Can S3 be used as a state backend for
>> the Job Manager? The documentation talks about using S3 as a state backend
>> for TM[2] (and in particular for streaming), but I'm wondering if it's a
>> suitable backend for the JM as well.
>>
>> * How do I configure S3 for Flink when I don't already have an existing
>> Hadoop cluster? The documentation references the Hadoop configuration
>> manifest[3], which kind of implies to me that I must already be running
>> Hadoop (or at least have a properly configured Hadoop cluster). Is there an
>> example somewhere of using S3 as a storage backend for a standalone cluster?
>>
>> * Bonus: I'm writing a Puppet module for installing/configuring/managing
>> Flink in stand alone mode with an existing zk cluster. Are there any
>> existing modules for this (I didn't find anything in the forge)? Would
>> others in the community be interested if we added our module to the forge
>> once complete?
>>
>> Thanks so much for your time and consideration. We look forward to using
>> Flink in production!
>>
>> Cheers,
>> Michael-Keith
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
>>
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
>>
>> [3]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
>
>

Re: Flink + S3

Reply via email to