> On 05 Feb 2016, at 08:56, Jeyhun Karimov <je.kari...@gmail.com> wrote: > > For example, I will do aggregate operations with other windows (n-window > aggregations) that are already outputted. > I tried your suggestion and used filesystem sink, outputted to HDFS. > I got k files in HDFS directory where k is the number of parallelism (I used > single machine). > These files get bigger (new records are appended) as stream continues. > Because they are (outputted files) not closed and file size is changed > regularly, would this cause some problems while processing data with dataset > api or hadoop or another library?
I think you have used the plain file sink and Robert was referring to the rolling HDFS file sink [1] This will bucket your data in different directories like this: /base/path/{date-time}/part-{parallel-task}-{count} – Ufuk [1] https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/connectors/hdfs.html