Re: Reading a single input file in parallel?

Jörn Franke Sun, 18 Feb 2018 04:36:36 -0800

AFAIK Flink has a similar notion of splittable as Hadoop. Furthermore you can 
set for custom Fileibputformats the attribute unsplittable = true if your file 
format cannot be split


> On 18. Feb 2018, at 13:28, Niels Basjes <ni...@basjes.nl> wrote:
> 
> Hi,
> 
> In Hadoop MapReduce there is the notion of "splittable" in the
> FileInputFormat. This has the effect that a single input file can be fed
> into multiple separate instances of the mapper that read the data.
> A lot has been documented (i.e. text is splittable per line, gzipped text
> is not splittable) and designed into the various file formats (like Avro
> and Parquet) to allow splittability.
> 
> The goal is that reading and parsing files can be done by multiple
> cpus/systems in parallel.
> 
> How is this handled in Flink?
> Can Flink read a single file in parallel?
> How does Flink administrate/handle the possibilities regarding the various
> file formats?
> 
> 
> The reason I ask is because I want to see if I can port this (now Hadoop
> specific) hobby project of mine to work with Flink:
> https://github.com/nielsbasjes/splittablegzip
> 
> Thanks.
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes

Re: Reading a single input file in parallel?

Reply via email to