Hi, The BucketingSink closes files once they reached a certain size (BatchSize) or have not been written to for a certain amount of time (InactiveBucketThreshold). While being written to, files are in an in-progress state and only moved to to completed once being closed. When that happens, other systems can pick up the file and process it. Processing a non-closed file would cause many problems.
However, closing files on every checkpoint would likely result in many small files which HDFS doesn't support so well. You can of course take the BucketingSink code and adapt it to your use case. Best, Fabian 2018-03-20 2:13 GMT+01:00 XilangYan <xilang....@gmail.com>: > The behavior of BucketingSink is not exactly we want. > If I understood correctly, when checkpoint requested, BucketingSink will > flush writer to make sure data not loss, but will not close file, nor roll > new file after checkpoint. > In the case of HDFS, if file length is not updated to name node(through > close file or update file length specifically), MR or other data analysis > tool will not read new data. This is not we desired. > I also want to open new file for each checkpoint period to make sure HDFS > file is persistent, because we met some bugs in flush/append hdfs file user > case. > > Is there anyway to let BucketingSink roll file on each checkpoint? Thanks > in > advance. > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >