Hi Kirti For the watermark problem, I think the description in the document mainly refers to the out-of-order data between multiple files. This will result in a large number of late events [1], which will generate a large number of retract events, and late events out of time will be discarded.
[1] https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/#lateness Best, Shammon FY On Thu, Apr 13, 2023 at 8:27 PM Kirti Dhar Upadhyay K via user < user@flink.apache.org> wrote: > Hi, > > > > I am using Data stream file source connector in one of my use case. > > I was going through the documentation where I found below limitations: > > > > > https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations > > 1. Watermarking does not work very well for large backlogs of files. > This is because watermarks eagerly advance within a file, and the next file > might contain data later than the watermark. > > *Queries:* > > Is there any FLIP/design document to better understand the impact of these > limitations? > > Also, is there any work ongoing on these limitations for future Flink > releases, if yes, please redirect to any related document? > > > > > > 1. For Unbounded File Sources, the enumerator currently remembers > paths of all already processed files, which is a state that can, in some > cases, grow rather large. > > *Query:* > > What all data per file is part of checkpointing state by file > source? > > > > Appreciate any help! > > > > Regards, > > Kirti Dhar >