Hi Carlos, AFAIK, Flink FileSource is capable of checkpointing while reading the files (at least in Streaming Mode). As for the watermarks, I think FLIP-182 [1] could solve the problem; however, it's currently under development.
I'm also pulling in Arvid and Fabian who are more familiar with the subject. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources Regards, Roman On Wed, Apr 6, 2022 at 4:17 PM Carlos Downey <carlos.downey...@gmail.com> wrote: > > Hi, > > We have an in-house platform that we want to integrate with external clients > via HDFS. They have lots of existing files and they continuously put more > data to HDFS. Ideally, we would like to have a Flink job that takes care of > ingesting data as one of the requirements is to execute SQL on top of these > files. We looked at existing FileSource implementation but we believe this > will not be well suited for this use case. > - firstly, we'd have to ingest all files initially present on HDFS before > completing first checkpoint - this is unacceptable for us as we would have to > reprocess all the files again in case of early job failure. Not to mention > the state blowing up for aggregations. > - secondly, we see now way to establish valid watermark strategy. This is a > major pain point that we can't find the right answer for. We don't want to > assume too much about the data itself. In general, the only solutions we see > require some sort of synchronization across subtasks. On the other hand, the > simplest strategy is to delay the watermark. In that case though we are > afraid of accidentally dropping events. > > Given this, we think about implementing our own file source, have someone in > the community already tried solving similar problem? If not, any suggestions > about the concerns we raised would be valuable.