Re: Skip Badly Compressed Input Files

Kim Vogt Tue, 25 Jan 2011 16:52:28 -0800

Do you catch the error when you load with pig, or is that a pre-pig step?
If I wanted to catch the error in a pig load, is it possible?  Where would
that code go?


-Kim

On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:

> Yeah so the unexpected EOF is the most common one we get (lzo requires a
> footer, and sometimes filehandles are closed before a footer is written, if
> the network hiccups or something).
>
> Right now what we do is scan before moving to the DW, and if not
> successful,
> extract what's extractable, catch the error, log how much data is lost
> (what's left to read), and expose stats about this sort of thing to
> monitoring software so we can alert if stuff gets out of hand.
>
> D
>
> On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote:
>
> > This is the error I'm getting:
> >
> > java.io.EOFException: Unexpected end of input stream
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> >        at java.io.InputStream.read(InputStream.java:85)
> >        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> >        at
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> >        at
> com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > Source)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> >        at
> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> >
> > I'm trying to drill down on what files it's bonking on, but I believe the
> > data is corrupt from when flume and amazon hated me and died.
> >
> > Maybe I can skip bad files and just log their names somewhere, and we
> > probably should add some correctness tests :-)
> >
> > -Kim
> >
> >
> > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> > wrote:
> >
> > > How badly compressed are they? Problems in the codec, or in the data
> that
> > > comes out of the codec?
> > >
> > > We've had some lzo corruption problems, and so far have simply been
> > dealing
> > > with that by doing correctness tests in our log mover pipeline before
> > > moving
> > > into the "data warehouse" area.
> > >
> > > Skipping bad files silently seems like asking for trouble (at some
> point
> > > the
> > > problem quietly grows and you wind up skipping most of your data), so
> > I've
> > > been avoiding putting something like that in so that when things are
> > badly
> > > broken, we get some early pain rather than lots of late pain.
> > >
> > > D
> > >
> > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm processing gzipped compressed files in a directory, but some
> files
> > > are
> > > > corrupted and can't be decompressed.  Is there a way to skip the bad
> > > files
> > > > with a custom load func?
> > > >
> > > > -Kim
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Reply via email to