I do it pre-pig. I think this has to be handled at the RecordReader level if you wanted to do it in the framework.
Hey want to contribute to the erorr handling design discussion? :) We haven't thought about LoadFuncs yet.. http://wiki.apache.org/pig/PigErrorHandlingInScripts On Tue, Jan 25, 2011 at 4:51 PM, Kim Vogt <k...@simplegeo.com> wrote: > Do you catch the error when you load with pig, or is that a pre-pig step? > If I wanted to catch the error in a pig load, is it possible? Where would > that code go? > > -Kim > > On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com> > wrote: > > > Yeah so the unexpected EOF is the most common one we get (lzo requires a > > footer, and sometimes filehandles are closed before a footer is written, > if > > the network hiccups or something). > > > > Right now what we do is scan before moving to the DW, and if not > > successful, > > extract what's extractable, catch the error, log how much data is lost > > (what's left to read), and expose stats about this sort of thing to > > monitoring software so we can alert if stuff gets out of hand. > > > > D > > > > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote: > > > > > This is the error I'm getting: > > > > > > java.io.EOFException: Unexpected end of input stream > > > at > > > > > > org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99) > > > at > > > > > > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87) > > > at > > > > > > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) > > > at java.io.InputStream.read(InputStream.java:85) > > > at > org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) > > > at > > > > > > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97) > > > at > > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown > > > Source) > > > at > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) > > > at > > > > > > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448) > > > at > > > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315) > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > > > at java.security.AccessController.doPrivileged(Native Method) > > > at javax.security.auth.Subject.doAs(Subject.java:396) > > > at > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > > > at org.apache.hadoop.mapred.Child.main(Child.java:211) > > > > > > I'm trying to drill down on what files it's bonking on, but I believe > the > > > data is corrupt from when flume and amazon hated me and died. > > > > > > Maybe I can skip bad files and just log their names somewhere, and we > > > probably should add some correctness tests :-) > > > > > > -Kim > > > > > > > > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com> > > > wrote: > > > > > > > How badly compressed are they? Problems in the codec, or in the data > > that > > > > comes out of the codec? > > > > > > > > We've had some lzo corruption problems, and so far have simply been > > > dealing > > > > with that by doing correctness tests in our log mover pipeline before > > > > moving > > > > into the "data warehouse" area. > > > > > > > > Skipping bad files silently seems like asking for trouble (at some > > point > > > > the > > > > problem quietly grows and you wind up skipping most of your data), so > > > I've > > > > been avoiding putting something like that in so that when things are > > > badly > > > > broken, we get some early pain rather than lots of late pain. > > > > > > > > D > > > > > > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote: > > > > > > > > > Hi, > > > > > > > > > > I'm processing gzipped compressed files in a directory, but some > > files > > > > are > > > > > corrupted and can't be decompressed. Is there a way to skip the > bad > > > > files > > > > > with a custom load func? > > > > > > > > > > -Kim > > > > > > > > > > > > > > >