On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas <cdoug...@apache.org> wrote:
> On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <ni...@basjes.nl> wrote: > > and if you then give the file the .gz extension this breaks all common > > sense / conventions about file names. > > That the suffix for all compression codecs in every context- and all > future codecs- should determine whether a file can be split is not an > assumption we can make safely. Again, that's not an assumption that > held when people built their current systems, and they would be justly > annoyed with the project for changing it. That's not what I meant. What I understood from what was described is that sometimes people use an existing file extension (like .gz) for a file that is not a gzipped file. If a file is splittable or not depends greatly on the actual codec implementation that is used to read it. Using the default GzipCodec a .gz file is not splittable, but that can be changed with a different implementation like for example this https://github.com/nielsbasjes/splittablegzip So given a file extension the file 'must' be a file that is the format that is described by the file name extension. The flow is roughly as follows - What is the file extension - Get the codec class registered to that extension - Is this a splittable codec ? (Does this class implement the splittablecodec interface) > I hold "correct data" much higher than performance and scalability; so the > > performance impact is a concern but it is much less important than the > list > > of bugs we are facing right now. > > These are not bugs. NLineInputFormat doesn't support compressed input, > and why would it? -C > I'm not saying it should (in fact, for this one I agree that it shouldn't). The reality is that it accepts the file, decompresses it and then produces output that 'looks good' but really is garbage. I consider "silently producing garbage" one of the worst kinds of problem to tackle. Because many custom file based input formats have stumbled (getting "silently produced garbage") over the current isSplitable implementation I really want to avoid any more of this in the future. That is why I want to change the implementations in this area of Hadoop in such a way that this "silently producing garbage" effect is taken out. So the question remains: What is the way this should be changed? I'm willing to build it and submit a patch. > > The safest way would be either 2 or 4. Solution 3 would effectively be > the > > same as the current implementation, yet it would catch the problem > > situations as long as people stick to normal file name conventions. > > Solution 3 would also allow removing some code duplication in several > > subclasses. > > > > I would go for solution 3. > > > > Niels Basjes > -- Best regards / Met vriendelijke groeten, Niels Basjes