Hi Guowei, Thanks for your answer. Regarding your question, *> Currently there is no such public interface ,which you could extend to implement your own strategy. Would you like to share the specific problem you currently meet?*
The GCS bucket that we are trying to read from is periodically populated with parquet files by another service. This can be daily or even hourly. For an already pre-populated bucket, we would like to read the files created from, say, day *T* till day *T+10*. Order matters here and hence we would like to read the oldest files first, and then the new ones. Would you know how I can enforce a reading order here ? Thanks, Meghajit On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <guowei....@gmail.com> wrote: > Hi, Meghajit > > 1. From the implementation [1] the order of split depends on the > implementation of the FileSystem. > > 2. From the implementation [2] the order of the file also depends on the > implementation of the FileSystem. > > 3. Currently there is no such public interface ,which you could extend to > implement your own strategy. Would you like to share the specific problem > you currently meet? > > 3. `FileSource` supports checkpoints. I think the watermark is a general > mechanism so you could read the related documentation[3]. > > [1] > https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141 > > [2] > https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102 > > [3] > https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ > Best, > Guowei > > > On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar < > meghajit.mazum...@gojek.com> wrote: > >> Hello, >> >> We are using FileSource >> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/> >> to process Parquet Files and had a few doubts around it. Would really >> appreciate if somebody can help answer them: >> >> 1. For a given file, does FileSource read the contents inside it in order >> ? In other words, what is the order in which the file splits are generated >> from the contents of the file ? >> >> 2. We want to provide a GCS Bucket URL to the FileSource so that it can >> read parquet files from there. The bucket has multiple parquet files. >> Wanted to know, what is the order in which the files will be picked and >> processed by this FileSource ? Can we provide an order strategy ourselves, >> say, process according to creation time ? >> >> 3. Is it possible/good practice to apply checkpointing and watermarking >> for a bounded source like FileSource ? >> >> -- >> *Regards,* >> *Meghajit* >> > -- *Regards,* *Meghajit*