DelimitedInputFormat reads entire buffer when splitLength is 0

Robert Schmidtke Fri, 10 Jul 2015 09:06:06 -0700

Hey everyone,

I just noticed that when processing input splits from a
DelimitedInputFormat (specifically, I have a text file with words in it),
that if the splitLength is 0, the entire readbuffer is filled (see
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java#L577).
I'm using XtreemFS as underlying file system, which stripes files in blocks
of 128kb across storage servers. I have 8 physically separate nodes, and my
input file is 1MB, such that each node stores 128kb of data. This is
reported accurately to Flink (e.g. split sizes and hostnames). Now when the
splitLength is 0 at some point during processing (which it will become
eventually), the entire file is read in again, which kind of defeats the
point of processing a split of length 0. Is this intended behavior? I've
tried multiple hot-fixes, but they ended up in the file not bein read in
its entirety. I would like to know the rationale behind this
implementation, and maybe figure out a way around it. Thanks in advance,


Robert

-- 
My GPG Key ID: 336E2680

DelimitedInputFormat reads entire buffer when splitLength is 0

Reply via email to