Hi, this is a valid approach. It might suffer from unbalanced load if the reader tasks process the files at different speed (or the files vary in size) because each task has to process the same number of files.
An alternative would be to implement your own InputFormat. The input format would create an InputSplit for each file to read. At runtime, the JM fetches a list of all InputSplits and lazily distributes them to running source tasks. This automatically balances the load because faster source tasks will process more splits than slower ones. Best, Fabian 2017-08-31 0:24 GMT+02:00 ShB <shon.balakris...@gmail.com>: > Hi Fabian, > > Thank you so much for your quick response, I appreciate it. > > Since I'm working with a very large number of files of small sizes, I don't > necessarily need to read each file in parallel. > > I need to read a my large list of files in parallel - that is, split up my > list of files into smaller subsets and have each task manager read a subset > of them. > > I implemented it like this: > env.fromCollection(fileList).rebalance().flatMap(new ReadFiles()); > where ReadFiles is a map function that reads each of the files from S3 > using > the AWS S3 Java SDK and parses and emits each of the protobufs. > > Is this implementation an efficient way of solving this problem? > > Is there a more performant way of reading a large number of files from S3 > in > a distributed manner, with perhaps env.readFile()? > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >