Reading a single input file in parallel?

Niels Basjes Sun, 18 Feb 2018 04:29:21 -0800

Hi,

In Hadoop MapReduce there is the notion of "splittable" in the
FileInputFormat. This has the effect that a single input file can be fed
into multiple separate instances of the mapper that read the data.
A lot has been documented (i.e. text is splittable per line, gzipped text
is not splittable) and designed into the various file formats (like Avro
and Parquet) to allow splittability.


The goal is that reading and parsing files can be done by multiple
cpus/systems in parallel.

How is this handled in Flink?
Can Flink read a single file in parallel?
How does Flink administrate/handle the possibilities regarding the various
file formats?


The reason I ask is because I want to see if I can port this (now Hadoop
specific) hobby project of mine to work with Flink:
https://github.com/nielsbasjes/splittablegzip

Thanks.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reading a single input file in parallel?

Reply via email to