Hi,
readFile() requests a FileInputFormat, i.e., your custom InputFormat would
need to extend FileInputFormat.
In general, any InputFormat decides about what to read when generating
InputSplits. In your case the, createInputSplits() method should return one
InputSplit for each file it wants to rea
Hi Fabian,
Thanks for your response.
If I implemented my own InputFormat, how would I read a specific list of
files from S3?
Assuming I need to use readFile(), below would read all of the files from
the specified S3 bucket or path:
env.readFile(MyInputFormat, "s3://my-bucket/")
Is there a way
Hi,
this is a valid approach.
It might suffer from unbalanced load if the reader tasks process the files
at different speed (or the files vary in size) because each task has to
process the same number of files.
An alternative would be to implement your own InputFormat.
The input format would crea
Hi Fabian,
Thank you so much for your quick response, I appreciate it.
Since I'm working with a very large number of files of small sizes, I don't
necessarily need to read each file in parallel.
I need to read a my large list of files in parallel - that is, split up my
list of files into small
read
> from s3 in a distributed manner?
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Distributed-
> reading-and-parsing-of-protobuf-files-from-S3-in-Apache-Flink-tp14480.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>
files be read from s3 using the Flink
Dataset API(like env.readFile)? How can these custom binary files be read
from s3 in a distributed manner?
--
View this message in context:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Distributed-reading-and-parsing-of-protobuf-files-