Re: Source with S3 bucket with millions ( billions ) of object ( small files )

Arvid Heise Mon, 04 Apr 2022 02:30:02 -0700

Hi Vishal,

with readFile, files are first collected and then sorted [1]. The same is
true for the new FileSource. Here, you could plugin your own Enumerator to
output files in chunks but then you need to continuously pull more and
can't use batch mode.


We are happy to receive any patch for that behavior (for the new source).

[1]
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/ContinuousFileMonitoringFunction.java#L259-L261

On Mon, Apr 4, 2022 at 12:07 AM Vishal Santoshi <vishal.santo...@gmail.com>
wrote:

> Folks,
>         I am doing a simple batch job that uses readFile() with
> "s3a://[bucket_name]" as the path with setNestedFileEnumeration(true). I am
> a little curious about a few things.
>
> In batch mode which I think is turned on by
> FileProcessingMode.PROCESS_ONCE mode does the source list all the S3
> objects in the bucket to create input splits *before* it calls downstream
> operators ?
>
>
>
>
> Thanks.
>
>
>
>
>
>

Re: Source with S3 bucket with millions ( billions ) of object ( small files )

Reply via email to