Re: HDFS streaming source concerns

2022-04-19 Thread Adrian Bednarz
Hello, We are actually working on a similar problem against S3. The checkpointing thing got me thinking if the checkpoint would indeed succeed with a large backlog of files. I always imagined that SplitEnumerator lists all available files and SourceReader is responsible for reading those files aft

Re: HDFS streaming source concerns

2022-04-08 Thread Roman Khachatryan
Hi Carlos, AFAIK, Flink FileSource is capable of checkpointing while reading the files (at least in Streaming Mode). As for the watermarks, I think FLIP-182 [1] could solve the problem; however, it's currently under development. I'm also pulling in Arvid and Fabian who are more familiar with the

HDFS streaming source concerns

2022-04-06 Thread Carlos Downey
Hi, We have an in-house platform that we want to integrate with external clients via HDFS. They have lots of existing files and they continuously put more data to HDFS. Ideally, we would like to have a Flink job that takes care of ingesting data as one of the requirements is to execute SQL on top