Hi,

Last week I ran into this problem again
https://issues.apache.org/jira/browse/MAPREDUCE-2094

What happens here is that the default implementation of the isSplitable
method in FileInputFormat is so unsafe that just about everyone who
implements a new subclass is likely to get this wrong. The effect of
getting this wrong is that all unit tests succeed and running it against
'large' input files (>>64MiB) that are compressed using a non-splittable
compression (often Gzip) will cause the input to be fed into the mappers
multiple time (i.e. you get garbage results without ever seeing any
errors).

Last few days I was at Berlin buzzwords talking to someone about this bug
and this resulted in the following proposal which I would like your
feedback on.

1) This is a change that will break backwards compatibility (deliberate
choice).
2) The FileInputFormat will get 3 methods (the old isSplitable with the
typo of one 't' in the name will disappear):
    (protected) isSplittableContainer --> true unless compressed with
non-splittable compression.
    (protected) isSplittableContent --> abstract, MUST be implemented by
the subclass
    (public)      isSplittable --> isSplittableContainer &&
isSplittableContent

The idea is that only the isSplittable is used by other classes to know if
this is a splittable file.
The effect I hope to get is that a developer writing their own
fileinputformat (which I alone have done twice so far) is 'forced' and
'helped' getting this right.

The reason for me to propose this as an incompatible change is that this
way I hope to eradicate some of the existing bugs in custom implementations
'out there'.

P.S. If you agree to this change then I'm willing to put my back into it
and submit a patch.

-- 
Best regards,

Niels Basjes

Reply via email to