Re: Source with S3 bucket with millions ( billions ) of object ( small files )

2022-04-04 Thread Vishal Santoshi
Thanks for the clarification. My experiments have been in line with what you have suggested. Regards. On Mon, Apr 4, 2022 at 5:30 AM Arvid Heise wrote: > Hi Vishal, > > with readFile, files are first collected and then sorted [1]. The same is > true for the new FileSource. Here, you could

Re: Source with S3 bucket with millions ( billions ) of object ( small files )

2022-04-04 Thread Arvid Heise
Hi Vishal, with readFile, files are first collected and then sorted [1]. The same is true for the new FileSource. Here, you could plugin your own Enumerator to output files in chunks but then you need to continuously pull more and can't use batch mode. We are happy to receive any patch for that b

Re: Source with S3 bucket with millions ( billions ) of object ( small files )

2022-04-04 Thread Roman Grebennikov
Hi, in a unified stream/batch FileSource there is a processStaticFileSet() method to enumerate all the splits only once, and make Source complete when it's finished. As for my own experience using the processStaticFileSet with large s3 buckets, the enumeration seems to happen on the jobmanager

Source with S3 bucket with millions ( billions ) of object ( small files )

2022-04-03 Thread Vishal Santoshi
Folks, I am doing a simple batch job that uses readFile() with "s3a://[bucket_name]" as the path with setNestedFileEnumeration(true). I am a little curious about a few things. In batch mode which I think is turned on by FileProcessingMode.PROCESS_ONCE mode does the source list all the S3 o