The same applies to Flink. Transient data will only be stored on local
disks.

Cheers,
Till

On Thu, Jan 30, 2020 at 9:10 PM Piper Piper <piperfl...@gmail.com> wrote:

> Please disregard my previous email. I found the answer online.
>
> I thought writing data to local disk automatically meant the data would be
> persisted to HDFS. However, Spark writes data (in between shuffles) to
> local disk only.
>
> Thanks
>
> On Thu, Jan 30, 2020, 2:00 PM Piper Piper <piperfl...@gmail.com> wrote:
>
>> Hi Till,
>>
>> Thank you for the information!
>>
>> In case of wide transformations, Spark stores input data onto disk
>> between shuffles. So, I was wondering if Flink does that as well (even for
>> windows of streaming data), and whether that "storing to disk" is persisted
>> to the HDFS and honors the replication factor.
>>
>> Best,
>>
>> Pankaj
>>
>> On Wed, Jan 29, 2020 at 9:56 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> Hi Piper,
>>>
>>> in general, Flink does not store transient data such as event data on
>>> HDFS. Event data (data which is sent between the TaskManager's to process
>>> it) is only kept in memory and if becoming too big spilled by some
>>> operators to local disk.
>>>
>>> What Flink stores on HDFS (given it is configured this way), is the
>>> state data which is part of the jobs checkpoints. Moreover, Flink stores
>>> the job information such as the JobGraph and the corresponding blobs (Jars
>>> and job artifacts) on HDFS if configured so.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <piperfl...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> When using Flink+YARN (with HDFS) and having a long running Flink
>>>> session (mode) cluster with a Flink client submitting jobs, the HDFS could
>>>> have a replication factor greater than 1 (example 3).
>>>>
>>>> So, I would like to know when and how any of the data (like event-data
>>>> or batch-data) or code (like JAR) in a Flink job is saved to the HDFS and
>>>> is replicated in the entire YARN cluster of nodes?
>>>>
>>>> For example, in streaming applications, would all the event-data only
>>>> be in memory (RAM) until it reaches the DAG's sink and then must be saved
>>>> into HDFS?
>>>>
>>>> Thank you,
>>>>
>>>> Piper
>>>>
>>>

Reply via email to