Re: Skip Badly Compressed Input Files

Dmitriy Ryaboy Tue, 25 Jan 2011 17:54:46 -0800

I do it pre-pig.
I think this has to be handled at the RecordReader level if you wanted to do
it in the framework.


Hey want to contribute to the erorr handling design discussion? :) We
haven't thought about LoadFuncs yet..

http://wiki.apache.org/pig/PigErrorHandlingInScripts


On Tue, Jan 25, 2011 at 4:51 PM, Kim Vogt <k...@simplegeo.com> wrote:

> Do you catch the error when you load with pig, or is that a pre-pig step?
> If I wanted to catch the error in a pig load, is it possible?  Where would
> that code go?
>
> -Kim
>
> On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> wrote:
>
> > Yeah so the unexpected EOF is the most common one we get (lzo requires a
> > footer, and sometimes filehandles are closed before a footer is written,
> if
> > the network hiccups or something).
> >
> > Right now what we do is scan before moving to the DW, and if not
> > successful,
> > extract what's extractable, catch the error, log how much data is lost
> > (what's left to read), and expose stats about this sort of thing to
> > monitoring software so we can alert if stuff gets out of hand.
> >
> > D
> >
> > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote:
> >
> > > This is the error I'm getting:
> > >
> > > java.io.EOFException: Unexpected end of input stream
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> > >        at java.io.InputStream.read(InputStream.java:85)
> > >        at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > >        at
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> > >        at
> > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > > Source)
> > >        at
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> > >        at
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> > >        at
> > > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> > >        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> > >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> > >        at java.security.AccessController.doPrivileged(Native Method)
> > >        at javax.security.auth.Subject.doAs(Subject.java:396)
> > >        at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> > >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> > >
> > > I'm trying to drill down on what files it's bonking on, but I believe
> the
> > > data is corrupt from when flume and amazon hated me and died.
> > >
> > > Maybe I can skip bad files and just log their names somewhere, and we
> > > probably should add some correctness tests :-)
> > >
> > > -Kim
> > >
> > >
> > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> > > wrote:
> > >
> > > > How badly compressed are they? Problems in the codec, or in the data
> > that
> > > > comes out of the codec?
> > > >
> > > > We've had some lzo corruption problems, and so far have simply been
> > > dealing
> > > > with that by doing correctness tests in our log mover pipeline before
> > > > moving
> > > > into the "data warehouse" area.
> > > >
> > > > Skipping bad files silently seems like asking for trouble (at some
> > point
> > > > the
> > > > problem quietly grows and you wind up skipping most of your data), so
> > > I've
> > > > been avoiding putting something like that in so that when things are
> > > badly
> > > > broken, we get some early pain rather than lots of late pain.
> > > >
> > > > D
> > > >
> > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm processing gzipped compressed files in a directory, but some
> > files
> > > > are
> > > > > corrupted and can't be decompressed.  Is there a way to skip the
> bad
> > > > files
> > > > > with a custom load func?
> > > > >
> > > > > -Kim
> > > > >
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Reply via email to