Thank you very much. You've been very helpful. Since my intermediate results are large, I suspect that io.tmp.dirs must literally be on the local file system. Thus, since I use EMR, I'll need to configure EBS to support more data.
On Tue, May 18, 2021 at 11:08 PM Yun Gao <yungao...@aliyun.com> wrote: > Hi Marco, > > With BATCH mode, all the ALL_TO_ALL edges would be marked as blocking > and would use intermediate file to transfer data. Flink now support hash > shuffle > and sort shuffle for blocking edges[1], both of them stores the > intermediate files in > the directories configured by io.tmp.dirs[2]. > > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/batch/blocking_shuffle/ > [2] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#io-tmp-dirs > > ------------------Original Mail ------------------ > *Sender:*Marco Villalobos <mvillalo...@kineteque.com> > *Send Date:*Wed May 19 09:50:45 2021 > *Recipients:*user <user@flink.apache.org> > *Subject:*DataStream Batch Execution Mode and large files. > >> Hi, >> >> I am using the DataStream API in Batch Execution Mode, and my "source" is >> an s3 Buckets with about 500 GB of data spread across many files. >> >> Where does Flink stored the results of processed / produced data between >> tasks? >> >> There is no way that 500GB will fit in memory. So I am very curious how >> that happens. >> >> Can somebody please explain? >> >> Thank you. >> >> Marco A. Villalobos >> >