Re: Source with S3 bucket with millions ( billions ) of object ( small files )

Vishal Santoshi Mon, 04 Apr 2022 12:09:03 -0700

Thanks for the clarification.

    My experiments have been in line with what you have suggested.


Regards.

On Mon, Apr 4, 2022 at 5:30 AM Arvid Heise <ar...@apache.org> wrote:

> Hi Vishal,
>
> with readFile, files are first collected and then sorted [1]. The same is
> true for the new FileSource. Here, you could plugin your own Enumerator to
> output files in chunks but then you need to continuously pull more and
> can't use batch mode.
>
> We are happy to receive any patch for that behavior (for the new source).
>
> [1]
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/ContinuousFileMonitoringFunction.java#L259-L261
>
> On Mon, Apr 4, 2022 at 12:07 AM Vishal Santoshi <vishal.santo...@gmail.com>
> wrote:
>
>> Folks,
>>         I am doing a simple batch job that uses readFile() with
>> "s3a://[bucket_name]" as the path with setNestedFileEnumeration(true). I am
>> a little curious about a few things.
>>
>> In batch mode which I think is turned on by
>> FileProcessingMode.PROCESS_ONCE mode does the source list all the S3
>> objects in the bucket to create input splits *before* it calls
>> downstream operators ?
>>
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>>

Re: Source with S3 bucket with millions ( billions ) of object ( small files )

Reply via email to