Hi, Meghajit Thanks Meghajit for sharing your user case. I found a workaround way that you could try to name your file in a timestamp style. More details could be found here[1]. Another little concern is that Flink is a distributed system, which means that we could not assume any order even if we list the file in the created order.
[1] https://stackoverflow.com/questions/49045725/gsutil-gcloud-storage-file-listing-sorted-date-descending Best, Guowei On Thu, Jan 20, 2022 at 11:11 PM Meghajit Mazumdar < meghajit.mazum...@gojek.com> wrote: > Hi Guowei, > > Thanks for your answer. Regarding your question, > *> Currently there is no such public interface ,which you could extend to > implement your own strategy. Would you like to share the specific problem > you currently meet?* > > The GCS bucket that we are trying to read from is periodically populated > with parquet files by another service. This can be daily or even hourly. > For an already pre-populated bucket, we would like to read the files > created from, say, day *T* till day *T+10*. Order matters here and hence > we would like to read the oldest files first, and then the new ones. Would > you know how I can enforce a reading order here ? > > Thanks, > Meghajit > > > > > On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <guowei....@gmail.com> wrote: > >> Hi, Meghajit >> >> 1. From the implementation [1] the order of split depends on the >> implementation of the FileSystem. >> >> 2. From the implementation [2] the order of the file also depends on the >> implementation of the FileSystem. >> >> 3. Currently there is no such public interface ,which you could extend to >> implement your own strategy. Would you like to share the specific problem >> you currently meet? >> >> 3. `FileSource` supports checkpoints. I think the watermark is a general >> mechanism so you could read the related documentation[3]. >> >> [1] >> https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141 >> >> [2] >> https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102 >> >> [3] >> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ >> Best, >> Guowei >> >> >> On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar < >> meghajit.mazum...@gojek.com> wrote: >> >>> Hello, >>> >>> We are using FileSource >>> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/> >>> to process Parquet Files and had a few doubts around it. Would really >>> appreciate if somebody can help answer them: >>> >>> 1. For a given file, does FileSource read the contents inside it in >>> order ? In other words, what is the order in which the file splits are >>> generated from the contents of the file ? >>> >>> 2. We want to provide a GCS Bucket URL to the FileSource so that it can >>> read parquet files from there. The bucket has multiple parquet files. >>> Wanted to know, what is the order in which the files will be picked and >>> processed by this FileSource ? Can we provide an order strategy ourselves, >>> say, process according to creation time ? >>> >>> 3. Is it possible/good practice to apply checkpointing and watermarking >>> for a bounded source like FileSource ? >>> >>> -- >>> *Regards,* >>> *Meghajit* >>> >> > > -- > *Regards,* > *Meghajit* >