Hi,
readFile() requests a FileInputFormat, i.e., your custom InputFormat would
need to extend FileInputFormat.
In general, any InputFormat decides about what to read when generating
InputSplits. In your case the, createInputSplits() method should return one
InputSplit for each file it wants to rea
Hi Fabian,
Thanks for your response.
If I implemented my own InputFormat, how would I read a specific list of
files from S3?
Assuming I need to use readFile(), below would read all of the files from
the specified S3 bucket or path:
env.readFile(MyInputFormat, "s3://my-bucket/")
Is there a way
Hi,
this is a valid approach.
It might suffer from unbalanced load if the reader tasks process the files
at different speed (or the files vary in size) because each task has to
process the same number of files.
An alternative would be to implement your own InputFormat.
The input format would crea
Hi Fabian,
Thank you so much for your quick response, I appreciate it.
Since I'm working with a very large number of files of small sizes, I don't
necessarily need to read each file in parallel.
I need to read a my large list of files in parallel - that is, split up my
list of files into small
Hi,
it depends on the file format whether a file can be read in parallel or
not. Basically, you have to be able to identify valid offsets from which
you can start reading.
There are a few techniques like fixed sized blocks with padding or a footer
section with split offsets, but if the file is alr