Re: Flink+YARN HDFS replication factor

Till Rohrmann Wed, 29 Jan 2020 06:56:35 -0800

Hi Piper,

in general, Flink does not store transient data such as event data on HDFS.
Event data (data which is sent between the TaskManager's to process it) is
only kept in memory and if becoming too big spilled by some operators to
local disk.

What Flink stores on HDFS (given it is configured this way), is the state
data which is part of the jobs checkpoints. Moreover, Flink stores the job
information such as the JobGraph and the corresponding blobs (Jars and job
artifacts) on HDFS if configured so.

Cheers,
Till

On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <piperfl...@gmail.com> wrote:

> Hello,
>
> When using Flink+YARN (with HDFS) and having a long running Flink session
> (mode) cluster with a Flink client submitting jobs, the HDFS could have a
> replication factor greater than 1 (example 3).
>
> So, I would like to know when and how any of the data (like event-data or
> batch-data) or code (like JAR) in a Flink job is saved to the HDFS and is
> replicated in the entire YARN cluster of nodes?
>
> For example, in streaming applications, would all the event-data only be
> in memory (RAM) until it reaches the DAG's sink and then must be saved into
> HDFS?
>
> Thank you,
>
> Piper
>

Re: Flink+YARN HDFS replication factor

Reply via email to