Re: FileSource Usage

Guowei Ma Thu, 20 Jan 2022 18:58:22 -0800

Hi, Meghajit

Thanks Meghajit for sharing your user case.
I found a workaround way that you could try to name your file in a
timestamp style. More details could be found here[1].
Another little concern is that Flink is a distributed system, which means
that we could not assume any order even if we list the file in the created
order.


[1]
https://stackoverflow.com/questions/49045725/gsutil-gcloud-storage-file-listing-sorted-date-descending
Best,
Guowei


On Thu, Jan 20, 2022 at 11:11 PM Meghajit Mazumdar <
meghajit.mazum...@gojek.com> wrote:

> Hi Guowei,
>
> Thanks for your answer. Regarding your question,
> *> Currently there is no such public interface ,which you could extend to
> implement your own strategy. Would you like to share the specific problem
> you currently meet?*
>
> The GCS bucket that we are trying to read from is periodically populated
> with parquet files by another service. This can be daily or even hourly.
> For an already pre-populated bucket, we would like to read the files
> created from, say, day *T* till day *T+10*.  Order matters here and hence
> we would like to read the oldest files first, and then the new ones.  Would
> you know how I can enforce a reading order here ?
>
> Thanks,
> Meghajit
>
>
>
>
> On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <guowei....@gmail.com> wrote:
>
>> Hi, Meghajit
>>
>> 1. From the implementation [1] the order of split depends on the
>> implementation of the FileSystem.
>>
>> 2. From the implementation [2] the order of the file also depends on the
>> implementation of the FileSystem.
>>
>> 3. Currently there is no such public interface ,which you could extend to
>> implement your own strategy. Would you like to share the specific problem
>> you currently meet?
>>
>> 3. `FileSource` supports checkpoints. I think the watermark is a general
>> mechanism so you could read the related documentation[3].
>>
>> [1]
>> https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141
>>
>> [2]
>> https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102
>>
>> [3]
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
>> Best,
>> Guowei
>>
>>
>> On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar <
>> meghajit.mazum...@gojek.com> wrote:
>>
>>> Hello,
>>>
>>> We are using FileSource
>>> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/>
>>> to process Parquet Files and had a few doubts around it. Would really
>>> appreciate if somebody can help answer them:
>>>
>>> 1. For a given file, does FileSource read the contents inside it in
>>> order ? In other words, what is the order in which the file splits are
>>> generated from the contents of the file ?
>>>
>>> 2. We want to provide a GCS Bucket URL to the FileSource so that it can
>>> read parquet files from there. The bucket has multiple parquet files.
>>> Wanted to know, what is the order in which the files will be picked and
>>> processed by this FileSource ? Can we provide an order strategy ourselves,
>>> say, process according to creation time ?
>>>
>>> 3. Is it possible/good practice to apply checkpointing and watermarking
>>> for a bounded source like FileSource ?
>>>
>>> --
>>> *Regards,*
>>> *Meghajit*
>>>
>>
>
> --
> *Regards,*
> *Meghajit*
>

Re: FileSource Usage

Reply via email to