The same applies to Flink. Transient data will only be stored on local disks.
Cheers, Till On Thu, Jan 30, 2020 at 9:10 PM Piper Piper <piperfl...@gmail.com> wrote: > Please disregard my previous email. I found the answer online. > > I thought writing data to local disk automatically meant the data would be > persisted to HDFS. However, Spark writes data (in between shuffles) to > local disk only. > > Thanks > > On Thu, Jan 30, 2020, 2:00 PM Piper Piper <piperfl...@gmail.com> wrote: > >> Hi Till, >> >> Thank you for the information! >> >> In case of wide transformations, Spark stores input data onto disk >> between shuffles. So, I was wondering if Flink does that as well (even for >> windows of streaming data), and whether that "storing to disk" is persisted >> to the HDFS and honors the replication factor. >> >> Best, >> >> Pankaj >> >> On Wed, Jan 29, 2020 at 9:56 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Piper, >>> >>> in general, Flink does not store transient data such as event data on >>> HDFS. Event data (data which is sent between the TaskManager's to process >>> it) is only kept in memory and if becoming too big spilled by some >>> operators to local disk. >>> >>> What Flink stores on HDFS (given it is configured this way), is the >>> state data which is part of the jobs checkpoints. Moreover, Flink stores >>> the job information such as the JobGraph and the corresponding blobs (Jars >>> and job artifacts) on HDFS if configured so. >>> >>> Cheers, >>> Till >>> >>> On Wed, Jan 29, 2020 at 7:07 AM Piper Piper <piperfl...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> When using Flink+YARN (with HDFS) and having a long running Flink >>>> session (mode) cluster with a Flink client submitting jobs, the HDFS could >>>> have a replication factor greater than 1 (example 3). >>>> >>>> So, I would like to know when and how any of the data (like event-data >>>> or batch-data) or code (like JAR) in a Flink job is saved to the HDFS and >>>> is replicated in the entire YARN cluster of nodes? >>>> >>>> For example, in streaming applications, would all the event-data only >>>> be in memory (RAM) until it reaches the DAG's sink and then must be saved >>>> into HDFS? >>>> >>>> Thank you, >>>> >>>> Piper >>>> >>>