RE: SplitEnumerator and SourceReader

Kirti Dhar Upadhyay K via user Thu, 20 Apr 2023 05:55:34 -0700

Thanks a lot Martijn for quick response.

For point 3, I might got confused on below link:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface#FLIP27:RefactorSourceInterface-where_run_enumerator


Anyways,  thanks for clarifying all things.

Just a further question on
“Yes, because the enumerator needs to remember the paths of all currently 
processed files. Depending on the use case, that can grow to be big. This is 
documented at 
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations”

Is there any recommendation for this limitation like size of files or number of 
files or checkpointing state backend etc?

Regards,
Kirti Dhar

From: Martijn Visser <martijnvis...@apache.org>
Sent: 20 April 2023 18:14
To: Kirti Dhar Upadhyay K <kirti.k.dhar.upadh...@ericsson.com>
Cc: user@flink.apache.org
Subject: Re: SplitEnumerator and SourceReader

Hi Kirti Dhar,

1. The SourceReader downloads the file, which is assigned to him by the 
SplitEnumerator
2. This depends on the format; a BulkFormat like Parquet or ORC can be read in 
batches of records at a time.
3. The SplitEnumerator runs on the JobManager, not on a TaskManager. Have you 
read something different in the documentation?
4. Yes, because the enumerator needs to remember the paths of all currently 
processed files. Depending on the use case, that can grow to be big. This is 
documented at 
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#current-limitations

Best regards,

Martijn



On Thu, Apr 20, 2023 at 2:30 PM Kirti Dhar Upadhyay K via user 
<user@flink.apache.org<mailto:user@flink.apache.org>> wrote:
Hi Community,

I have started using file source of Flink 1.17.x recently.
I was going through the FLIP-27 documentation and as much I understand 
SplitEnumerator lists files (splits) and assigns to SourceReader. A single 
instance of SplitEnumerator  runs whereas parallelism can be done on 
SourceReader side. I have below queries on same:


  1.  Who actually downloads the file (let’s say the file is on S3)? Is it 
SplitEnumerator which downloads the files and then assign the splits to 
SourceReaders OR it only lists and give the path of file in split to 
SourceReader, which downloads the file and process?


  1.  Is the complete file downloaded in one go? OR chunked downloading is also 
possible?



  1.  I got that SplitEnumerator can be run on JobManager OR on single instance 
of TaskManager. How a user can configure it where to run?



  1.  Is there any memory footprint impact if FileSource is running in 
streaming mode (continuous streaming)?


Thanks for any help!

Regards,
Kirti Dhar

RE: SplitEnumerator and SourceReader

Reply via email to