I think breaking backwards compat is sensible since It's easily caught by the compiler and in this case where the alternative is a Runtime error that can result in terabytes of mucked up output.
> On May 29, 2014, at 6:11 AM, Matt Fellows <matt.fell...@bespokesoftware.com> > wrote: > > As someone who doesn't really contribute, just lurks, I could well be > misinformed or under-informed, but I don't see why we can't deprecate a > method which could cause dangerous side effects? > People can still use the deprecated methods for backwards compatibility, but > are discouraged by compiler warnings, and any changes they write to their > code can start to use the new functionality? > > *Apologies if I'm stepping into a Hadoop holy war here > > >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl> wrote: >> My original proposal (from about 3 years ago) was to change the isSplitable >> method to return a safe default ( you can see that in the patch that is >> still attached to that Jira issue). >> For arguments I still do not fully understand this was rejected by Todd and >> Doug. >> >> So that is why my new proposal is to deprecate (remove!) the old method >> with the typo in Hadoop 3.0 and replace it with something correct and less >> error prone. >> Given the fact that this would happen in a major version jump I thought >> that would be the right time to do that. >> >> Niels >> >> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran >> <ste...@hortonworks.com>wrote: >> >> > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote: >> > >> > > Hi, >> > > >> > > Last week I ran into this problem again >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094 >> > > >> > > What happens here is that the default implementation of the isSplitable >> > > method in FileInputFormat is so unsafe that just about everyone who >> > > implements a new subclass is likely to get this wrong. The effect of >> > > getting this wrong is that all unit tests succeed and running it against >> > > 'large' input files (>>64MiB) that are compressed using a non-splittable >> > > compression (often Gzip) will cause the input to be fed into the mappers >> > > multiple time (i.e. you get garbage results without ever seeing any >> > > errors). >> > > >> > > Last few days I was at Berlin buzzwords talking to someone about this bug >> > > >> > >> > that was me, I recall. >> > >> > >> > > and this resulted in the following proposal which I would like your >> > > feedback on. >> > > >> > > 1) This is a change that will break backwards compatibility (deliberate >> > > choice). >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable with the >> > > typo of one 't' in the name will disappear): >> > > (protected) isSplittableContainer --> true unless compressed with >> > > non-splittable compression. >> > > (protected) isSplittableContent --> abstract, MUST be implemented by >> > > the subclass >> > > (public) isSplittable --> isSplittableContainer && >> > > isSplittableContent >> > > >> > > The idea is that only the isSplittable is used by other classes to know >> > if >> > > this is a splittable file. >> > > The effect I hope to get is that a developer writing their own >> > > fileinputformat (which I alone have done twice so far) is 'forced' and >> > > 'helped' getting this right. >> > > >> > >> > I could see making the attributes more explicit would be good -but stopping >> > everything that exists from working isn't going to fly. >> > >> > what about some subclass, AbstractSplittableFileInputFormat that implements >> > the container properly, requires that content one -and then calculates >> > IsSplitable() from the results? Existing code: no change, new formats can >> > descend from this (and built in ones retrofitted). >> > >> > >> > >> > > The reason for me to propose this as an incompatible change is that this >> > > way I hope to eradicate some of the existing bugs in custom >> > implementations >> > > 'out there'. >> > > >> > > P.S. If you agree to this change then I'm willing to put my back into it >> > > and submit a patch. >> > > >> > > -- >> > > Best regards, >> > > >> > > Niels Basjes >> > > >> > >> > -- >> > CONFIDENTIALITY NOTICE >> > NOTICE: This message is intended for the use of the individual or entity to >> > which it is addressed and may contain information that is confidential, >> > privileged and exempt from disclosure under applicable law. If the reader >> > of this message is not the intended recipient, you are hereby notified that >> > any printing, copying, dissemination, distribution, disclosure or >> > forwarding of this communication is strictly prohibited. If you have >> > received this communication in error, please contact the sender immediately >> > and delete it from your system. Thank You. >> > >> >> >> >> -- >> Best regards / Met vriendelijke groeten, >> >> Niels Basjes > > > > -- > > First Option Software Ltd > Signal House > Jacklyns Lane > Alresford > SO24 9JJ > Tel: +44 (0)1962 738232 > Mob: +44 (0)7710 160458 > Fax: +44 (0)1962 600112 > Web: www.bespokesoftware.com > > ____________________________________________________ > > This is confidential, non-binding and not company endorsed - see full terms > at www.fosolutions.co.uk/emailpolicy.html > First Option Software Ltd Registered No. 06340261 > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. > ____________________________________________________ >