Re: Let BucketingSink roll file on each checkpoint

Fabian Hueske Tue, 20 Mar 2018 06:47:03 -0700

Hi,

The BucketingSink closes files once they reached a certain size (BatchSize)
or have not been written to for a certain amount of time
(InactiveBucketThreshold).
While being written to, files are in an in-progress state and only moved to
to completed once being closed. When that happens, other systems can pick
up the file and process it.
Processing a non-closed file would cause many problems.


However, closing files on every checkpoint would likely result in many
small files which HDFS doesn't support so well.
You can of course take the BucketingSink code and adapt it to your use case.

Best, Fabian

2018-03-20 2:13 GMT+01:00 XilangYan <xilang....@gmail.com>:

> The behavior of BucketingSink is not exactly we want.
> If I understood correctly, when checkpoint requested, BucketingSink will
> flush writer to make sure data not loss, but will not close file, nor roll
> new file after checkpoint.
> In the case of HDFS, if file length is not updated to name node(through
> close file or update file length specifically), MR or other data analysis
> tool will not read new data. This is not we desired.
> I also want to open new file for each checkpoint period to make sure HDFS
> file is persistent, because we met some bugs in flush/append hdfs file user
> case.
>
> Is there anyway to let BucketingSink roll file on each checkpoint? Thanks
> in
> advance.
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: Let BucketingSink roll file on each checkpoint

Reply via email to