This is the error I'm getting:

java.io.EOFException: Unexpected end of input stream
        at 
org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
        at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
        at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
        at java.io.InputStream.read(InputStream.java:85)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at 
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
        at com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown 
Source)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
        at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
        at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
        at org.apache.hadoop.mapred.Child.main(Child.java:211)

I'm trying to drill down on what files it's bonking on, but I believe the
data is corrupt from when flume and amazon hated me and died.

Maybe I can skip bad files and just log their names somewhere, and we
probably should add some correctness tests :-)

-Kim


On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> How badly compressed are they? Problems in the codec, or in the data that
> comes out of the codec?
>
> We've had some lzo corruption problems, and so far have simply been dealing
> with that by doing correctness tests in our log mover pipeline before
> moving
> into the "data warehouse" area.
>
> Skipping bad files silently seems like asking for trouble (at some point
> the
> problem quietly grows and you wind up skipping most of your data), so I've
> been avoiding putting something like that in so that when things are badly
> broken, we get some early pain rather than lots of late pain.
>
> D
>
> On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote:
>
> > Hi,
> >
> > I'm processing gzipped compressed files in a directory, but some files
> are
> > corrupted and can't be decompressed.  Is there a way to skip the bad
> files
> > with a custom load func?
> >
> > -Kim
> >
>

Reply via email to