Re: Flink + S3

Ufuk Celebi Tue, 19 Apr 2016 02:31:41 -0700

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?


In addition to what Till said:

We tried to document this here as well:
https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your
classpath.

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <trohrm...@apache.org> wrote:
> Hi Michael-Keith,
>
> you can use S3 as the checkpoint directory for the filesystem state backend.
> This means that whenever a checkpoint is performed the state data will be
> written to this directory.
>
> The same holds true for the zookeeper recovery storage directory. This
> directory will contain the submitted and not yet finished jobs as well as
> some meta data for the checkpoints. With this information it is possible to
> restore running jobs if the job manager dies.
>
> As far as I know, Flink relies on Hadoop's file system wrapper classes to
> support S3. Flink has built in support for hdfs, maprfs and the local file
> system. For everything else, Flink tries to find a Hadoop class. Therefore,
> I fear that you need at least Hadoop's s3 filesystem class in your classpath
> and a file called core-site.xml or hdfs-site.xml which is stored at a
> location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in
> one of these files you have to create the xml tag to specify the class. But
> the easiest way would be to simply install Hadoop.
>
> I'm not aware of any puppet scripts but I might miss something here. If you
> should complete a puppet script, then it would definitely be a valuable
> addition to Flink :-)
>
> Cheers,
> Till
>
> On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> <mkbern...@opentable.com> wrote:
>>
>> Hello Flink Users!
>>
>> I'm a Flink newbie at the early stages of deploying our first Flink
>> cluster into production and I have a few questions about wiring up Flink
>> with S3:
>>
>> * We are going to use the HA configuration[1] from day one (we have
>> existing zk infrastructure already). Can S3 be used as a state backend for
>> the Job Manager? The documentation talks about using S3 as a state backend
>> for TM[2] (and in particular for streaming), but I'm wondering if it's a
>> suitable backend for the JM as well.
>>
>> * How do I configure S3 for Flink when I don't already have an existing
>> Hadoop cluster? The documentation references the Hadoop configuration
>> manifest[3], which kind of implies to me that I must already be running
>> Hadoop (or at least have a properly configured Hadoop cluster). Is there an
>> example somewhere of using S3 as a storage backend for a standalone cluster?
>>
>> * Bonus: I'm writing a Puppet module for installing/configuring/managing
>> Flink in stand alone mode with an existing zk cluster. Are there any
>> existing modules for this (I didn't find anything in the forge)? Would
>> others in the community be interested if we added our module to the forge
>> once complete?
>>
>> Thanks so much for your time and consideration. We look forward to using
>> Flink in production!
>>
>> Cheers,
>> Michael-Keith
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
>>
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
>>
>> [3]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
>
>

Re: Flink + S3

Reply via email to