Hi,

I talked to some people and they agreed with me that really the situation
where this problem occurs is when they build a FileInputFormat derivative
that also uses a LineRecordReader derivative. This is exactly the scenario
that occurs if someone follows the Yahoo Hadoop tutorial.

Instead of changing the FileInputFormat (which many of the committers
considered to be a bad idea) I created a very simple patch for the
LineRecordReader that throws an exception (intentionally failing the entire
job) when it receives a split for a compressed file that had not been
compressed using a SplittableCompressionCodec and where the split does not
start at the beginning of the file. So fail if it detects a non splittable
file that has been split.

So if you run this against a 1GB gzipped file then the first split of the
whole will complete successfully and all other splits will fail without
even reading a single line.

As far as I can tell this is a simple, clean and compatible patch that does
not break anything. Also the change is limited to the most common place
where this problem occurs.
The only 'big' effect is that people who have been running a broken
implementation will no longer be able to run this broken code iff they feed
it 'large' non-splittable files. Which I thinks is a good thing.

What do you (the committers) think of this approach?

The patch I submitted a few days ago also includes the JavaDoc improvements
(in FileInputFormat) provided by Gian Merlino

https://issues.apache.org/jira/browse/MAPREDUCE-2094

Niels Basjes

P.S. I still thing that the FileInputFormat.isSplitable() should implement
a safe default instead of an optimistic default.



On Sat, Jun 14, 2014 at 10:33 AM, Niels Basjes <ni...@basjes.nl> wrote:

> I did some digging through the code base and inspected all the situations
> I know where this goes wrong (including the yahoo tutorial) and found a
> place that may be a spot to avoid the effects of this problem. (Instead of
> solving the cause the problem)
>
> It turns out that all of those use cases use the LineRecordReader to read
> the data. This class (both the mapred and mapreduce versions) have the
> notion of the split that needs to be read, if the file is compressed and if
> this is a splittable compression codec.
>
> Now if we were to add code there that validates if the provided splits are
> valid or not (i.e. did the developer make this bug or not) then we could
> avoid the garbage data problem before it is fed into the actual mapper.
> This must  then write error messages (+ message "did you know you have been
> looking at corrupted data for a long time") that will appear in the logs of
> all the mapper attempts.
>
> At that point we can do one of these two actions in the LineRecordReader:
> - Fail hard with an exception. The job fails and the user immediately goes
> to the developer of the inputformat with a bug report.
> - Avoid the problem: Read the entire file iff the start of the split is 0,
> else read nothing. Many users will see a dramatic change in their results
> and (hopefully) start digging deeper. (Iff a human actually looks at the
> data)
>
> I vote for the "fail hard" because then people are forced to fix the
> problem and correct the historical impact.
>
> Would this be a good / compatible solution?
>
> If so then I think we should have this in both the 2.x and 3.x.
>
> For the 3.x I also realized that perhaps the isSplittable is something
> that could be delegated to the record reader. Would that make sense or is
> this something that does not belong there?
> If not then I would still propose making the isSplittable abstract to fix
> the problem before it is created (in 3.x)
>
> Niels Basjes
> On Jun 13, 2014 11:47 PM, "Chris Douglas" <cdoug...@apache.org> wrote:
>
>> On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes <ni...@basjes.nl> wrote:
>> > Hmmm, people only look at logs when they have a problem. So I don't
>> think
>> > this would be enough.
>>
>> This change to the framework will cause disruptions to users, to aid
>> InputFormat authors' debugging. The latter is a much smaller
>> population and better equipped to handle this complexity.
>>
>> A log statement would print during submission, so it would be visible
>> to users. If a user's job is producing garbage but submission was
>> non-interactive, a log statement would be sufficient to debug the
>> issue. If the naming conflict is common in some contexts, the warning
>> can be disabled using the log configuration.
>>
>> Beyond that, input validation is the responsibility of the InputFormat
>> author.
>>
>> > Perhaps this makes sense:
>> > - For 3.0: Shout at the developer who does it wrong (i.e. make it
>> abstract
>> > and force them to think about this) i.e. Create new abstract method
>> > isSplittable (tt) in FileInputFormat, remove isSplitable (one t).
>> >
>> > To avoid needless code duplication (which we already have in the
>> codebase)
>> > create a helper method something like 'fileNameIndicatesSplittableFile'
>> (
>> > returns enum:  Splittable/NonSplittable/Unknown ).
>> >
>> > - For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in
>> all
>> > situations where the developer already did it wrong. (i.e. change
>> > isSplitable ==> return false) This costs performance only in those
>> > situations where the developer actually did it wrong (i.e. they didn't
>> > thing this through)
>> >
>> > How about that?
>>
>> -1 on the 2.x change for compatibility reasons.
>>
>> While we can break compatibility in the 3.x line, the tradeoff is
>> still not very compelling, frankly. -C
>>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to