Hi, We have an in-house platform that we want to integrate with external clients via HDFS. They have lots of existing files and they continuously put more data to HDFS. Ideally, we would like to have a Flink job that takes care of ingesting data as one of the requirements is to execute SQL on top of these files. We looked at existing FileSource implementation but we believe this will not be well suited for this use case. - firstly, we'd have to ingest all files initially present on HDFS before completing first checkpoint - this is unacceptable for us as we would have to reprocess all the files again in case of early job failure. Not to mention the state blowing up for aggregations. - secondly, we see now way to establish valid watermark strategy. This is a major pain point that we can't find the right answer for. We don't want to assume too much about the data itself. In general, the only solutions we see require some sort of synchronization across subtasks. On the other hand, the simplest strategy is to delay the watermark. In that case though we are afraid of accidentally dropping events.
Given this, we think about implementing our own file source, have someone in the community already tried solving similar problem? If not, any suggestions about the concerns we raised would be valuable.