Re: Skip Badly Compressed Input Files

Kim Vogt Tue, 25 Jan 2011 18:08:55 -0800

sure :-)

On Tue, Jan 25, 2011 at 5:54 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:


> I do it pre-pig.
> I think this has to be handled at the RecordReader level if you wanted to
> do
> it in the framework.
>
> Hey want to contribute to the erorr handling design discussion? :) We
> haven't thought about LoadFuncs yet..
>
> http://wiki.apache.org/pig/PigErrorHandlingInScripts
>
>
> On Tue, Jan 25, 2011 at 4:51 PM, Kim Vogt <k...@simplegeo.com> wrote:
>
> > Do you catch the error when you load with pig, or is that a pre-pig step?
> > If I wanted to catch the error in a pig load, is it possible?  Where
> would
> > that code go?
> >
> > -Kim
> >
> > On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> > wrote:
> >
> > > Yeah so the unexpected EOF is the most common one we get (lzo requires
> a
> > > footer, and sometimes filehandles are closed before a footer is
> written,
> > if
> > > the network hiccups or something).
> > >
> > > Right now what we do is scan before moving to the DW, and if not
> > > successful,
> > > extract what's extractable, catch the error, log how much data is lost
> > > (what's left to read), and expose stats about this sort of thing to
> > > monitoring software so we can alert if stuff gets out of hand.
> > >
> > > D
> > >
> > > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote:
> > >
> > > > This is the error I'm getting:
> > > >
> > > > java.io.EOFException: Unexpected end of input stream
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> > > >        at java.io.InputStream.read(InputStream.java:85)
> > > >        at
> > org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> > > >        at
> > > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > > > Source)
> > > >        at
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> > > >        at
> > > >
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> > > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> > > >        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> > > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> > > >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> > > >        at java.security.AccessController.doPrivileged(Native Method)
> > > >        at javax.security.auth.Subject.doAs(Subject.java:396)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> > > >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> > > >
> > > > I'm trying to drill down on what files it's bonking on, but I believe
> > the
> > > > data is corrupt from when flume and amazon hated me and died.
> > > >
> > > > Maybe I can skip bad files and just log their names somewhere, and we
> > > > probably should add some correctness tests :-)
> > > >
> > > > -Kim
> > > >
> > > >
> > > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> > > > wrote:
> > > >
> > > > > How badly compressed are they? Problems in the codec, or in the
> data
> > > that
> > > > > comes out of the codec?
> > > > >
> > > > > We've had some lzo corruption problems, and so far have simply been
> > > > dealing
> > > > > with that by doing correctness tests in our log mover pipeline
> before
> > > > > moving
> > > > > into the "data warehouse" area.
> > > > >
> > > > > Skipping bad files silently seems like asking for trouble (at some
> > > point
> > > > > the
> > > > > problem quietly grows and you wind up skipping most of your data),
> so
> > > > I've
> > > > > been avoiding putting something like that in so that when things
> are
> > > > badly
> > > > > broken, we get some early pain rather than lots of late pain.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com>
> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm processing gzipped compressed files in a directory, but some
> > > files
> > > > > are
> > > > > > corrupted and can't be decompressed.  Is there a way to skip the
> > bad
> > > > > files
> > > > > > with a custom load func?
> > > > > >
> > > > > > -Kim
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Reply via email to