This is the error I'm getting: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97) at com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown Source) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) at org.apache.hadoop.mapred.Child.main(Child.java:211)
I'm trying to drill down on what files it's bonking on, but I believe the data is corrupt from when flume and amazon hated me and died. Maybe I can skip bad files and just log their names somewhere, and we probably should add some correctness tests :-) -Kim On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > How badly compressed are they? Problems in the codec, or in the data that > comes out of the codec? > > We've had some lzo corruption problems, and so far have simply been dealing > with that by doing correctness tests in our log mover pipeline before > moving > into the "data warehouse" area. > > Skipping bad files silently seems like asking for trouble (at some point > the > problem quietly grows and you wind up skipping most of your data), so I've > been avoiding putting something like that in so that when things are badly > broken, we get some early pain rather than lots of late pain. > > D > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote: > > > Hi, > > > > I'm processing gzipped compressed files in a directory, but some files > are > > corrupted and can't be decompressed. Is there a way to skip the bad > files > > with a custom load func? > > > > -Kim > > >