Re: Skip Badly Compressed Input Files

Dmitriy Ryaboy Tue, 25 Jan 2011 16:45:13 -0800

Yeah so the unexpected EOF is the most common one we get (lzo requires a
footer, and sometimes filehandles are closed before a footer is written, if
the network hiccups or something).


Right now what we do is scan before moving to the DW, and if not successful,
extract what's extractable, catch the error, log how much data is lost
(what's left to read), and expose stats about this sort of thing to
monitoring software so we can alert if stuff gets out of hand.

D

On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote:

> This is the error I'm getting:
>
> java.io.EOFException: Unexpected end of input stream
>        at
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
>        at java.io.InputStream.read(InputStream.java:85)
>        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
>        at com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> Source)
>        at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> I'm trying to drill down on what files it's bonking on, but I believe the
> data is corrupt from when flume and amazon hated me and died.
>
> Maybe I can skip bad files and just log their names somewhere, and we
> probably should add some correctness tests :-)
>
> -Kim
>
>
> On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> wrote:
>
> > How badly compressed are they? Problems in the codec, or in the data that
> > comes out of the codec?
> >
> > We've had some lzo corruption problems, and so far have simply been
> dealing
> > with that by doing correctness tests in our log mover pipeline before
> > moving
> > into the "data warehouse" area.
> >
> > Skipping bad files silently seems like asking for trouble (at some point
> > the
> > problem quietly grows and you wind up skipping most of your data), so
> I've
> > been avoiding putting something like that in so that when things are
> badly
> > broken, we get some early pain rather than lots of late pain.
> >
> > D
> >
> > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote:
> >
> > > Hi,
> > >
> > > I'm processing gzipped compressed files in a directory, but some files
> > are
> > > corrupted and can't be decompressed.  Is there a way to skip the bad
> > files
> > > with a custom load func?
> > >
> > > -Kim
> > >
> >
>

Re: Skip Badly Compressed Input Files

Reply via email to