Hello, We are actually working on a similar problem against S3. The checkpointing thing got me thinking if the checkpoint would indeed succeed with a large backlog of files. I always imagined that SplitEnumerator lists all available files and SourceReader is responsible for reading those files afterwards. In this model I thought checkpoint barriers would be blocked until the initial files backlog is fully consumed.
In Flink UI though I can see that checkpoints are happening timely when performing a small experiment against local FS. Can somebody comment on how this works under the hood? Adrian On Fri, Apr 8, 2022 at 8:20 PM Roman Khachatryan <ro...@apache.org> wrote: > Hi Carlos, > > AFAIK, Flink FileSource is capable of checkpointing while reading the > files (at least in Streaming Mode). > As for the watermarks, I think FLIP-182 [1] could solve the problem; > however, it's currently under development. > > I'm also pulling in Arvid and Fabian who are more familiar with the > subject. > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-182%3A+Support+watermark+alignment+of+FLIP-27+Sources > > Regards, > Roman > > On Wed, Apr 6, 2022 at 4:17 PM Carlos Downey <carlos.downey...@gmail.com> > wrote: > > > > Hi, > > > > We have an in-house platform that we want to integrate with external > clients via HDFS. They have lots of existing files and they continuously > put more data to HDFS. Ideally, we would like to have a Flink job that > takes care of ingesting data as one of the requirements is to execute SQL > on top of these files. We looked at existing FileSource implementation but > we believe this will not be well suited for this use case. > > - firstly, we'd have to ingest all files initially present on HDFS > before completing first checkpoint - this is unacceptable for us as we > would have to reprocess all the files again in case of early job failure. > Not to mention the state blowing up for aggregations. > > - secondly, we see now way to establish valid watermark strategy. This > is a major pain point that we can't find the right answer for. We don't > want to assume too much about the data itself. In general, the only > solutions we see require some sort of synchronization across subtasks. On > the other hand, the simplest strategy is to delay the watermark. In that > case though we are afraid of accidentally dropping events. > > > > Given this, we think about implementing our own file source, have > someone in the community already tried solving similar problem? If not, any > suggestions about the concerns we raised would be valuable. >