Re: FileSource Usage

Meghajit Mazumdar Thu, 20 Jan 2022 07:11:22 -0800

Hi Guowei,

Thanks for your answer. Regarding your question,
*> Currently there is no such public interface ,which you could extend to
implement your own strategy. Would you like to share the specific problem
you currently meet?*


The GCS bucket that we are trying to read from is periodically populated
with parquet files by another service. This can be daily or even hourly.
For an already pre-populated bucket, we would like to read the files
created from, say, day *T* till day *T+10*.  Order matters here and hence
we would like to read the oldest files first, and then the new ones.  Would
you know how I can enforce a reading order here ?

Thanks,
Meghajit




On Thu, Jan 20, 2022 at 2:29 PM Guowei Ma <guowei....@gmail.com> wrote:

> Hi, Meghajit
>
> 1. From the implementation [1] the order of split depends on the
> implementation of the FileSystem.
>
> 2. From the implementation [2] the order of the file also depends on the
> implementation of the FileSystem.
>
> 3. Currently there is no such public interface ,which you could extend to
> implement your own strategy. Would you like to share the specific problem
> you currently meet?
>
> 3. `FileSource` supports checkpoints. I think the watermark is a general
> mechanism so you could read the related documentation[3].
>
> [1]
> https://github.com/apache/flink/blob/355b165859aebaae29b6425023d352246caa0613/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141
>
> [2]
> https://github.com/apache/flink/blob/d33c39d974f08a5ac520f220219ecb0796c9448c/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/NonSplittingRecursiveEnumerator.java#L102
>
> [3]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/
> Best,
> Guowei
>
>
> On Wed, Jan 19, 2022 at 6:06 PM Meghajit Mazumdar <
> meghajit.mazum...@gojek.com> wrote:
>
>> Hello,
>>
>> We are using FileSource
>> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/>
>> to process Parquet Files and had a few doubts around it. Would really
>> appreciate if somebody can help answer them:
>>
>> 1. For a given file, does FileSource read the contents inside it in order
>> ? In other words, what is the order in which the file splits are generated
>> from the contents of the file ?
>>
>> 2. We want to provide a GCS Bucket URL to the FileSource so that it can
>> read parquet files from there. The bucket has multiple parquet files.
>> Wanted to know, what is the order in which the files will be picked and
>> processed by this FileSource ? Can we provide an order strategy ourselves,
>> say, process according to creation time ?
>>
>> 3. Is it possible/good practice to apply checkpointing and watermarking
>> for a bounded source like FileSource ?
>>
>> --
>> *Regards,*
>> *Meghajit*
>>
>

-- 
*Regards,*
*Meghajit*

Re: FileSource Usage

Reply via email to