Do you catch the error when you load with pig, or is that a pre-pig step? If I wanted to catch the error in a pig load, is it possible? Where would that code go?
-Kim On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > Yeah so the unexpected EOF is the most common one we get (lzo requires a > footer, and sometimes filehandles are closed before a footer is written, if > the network hiccups or something). > > Right now what we do is scan before moving to the DW, and if not > successful, > extract what's extractable, catch the error, log how much data is lost > (what's left to read), and expose stats about this sort of thing to > monitoring software so we can alert if stuff gets out of hand. > > D > > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <k...@simplegeo.com> wrote: > > > This is the error I'm getting: > > > > java.io.EOFException: Unexpected end of input stream > > at > > > org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99) > > at > > > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87) > > at > > > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) > > at java.io.InputStream.read(InputStream.java:85) > > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134) > > at > > > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97) > > at > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown > > Source) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) > > at > > > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448) > > at > > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315) > > at org.apache.hadoop.mapred.Child$4.run(Child.java:217) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063) > > at org.apache.hadoop.mapred.Child.main(Child.java:211) > > > > I'm trying to drill down on what files it's bonking on, but I believe the > > data is corrupt from when flume and amazon hated me and died. > > > > Maybe I can skip bad files and just log their names somewhere, and we > > probably should add some correctness tests :-) > > > > -Kim > > > > > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dvrya...@gmail.com> > > wrote: > > > > > How badly compressed are they? Problems in the codec, or in the data > that > > > comes out of the codec? > > > > > > We've had some lzo corruption problems, and so far have simply been > > dealing > > > with that by doing correctness tests in our log mover pipeline before > > > moving > > > into the "data warehouse" area. > > > > > > Skipping bad files silently seems like asking for trouble (at some > point > > > the > > > problem quietly grows and you wind up skipping most of your data), so > > I've > > > been avoiding putting something like that in so that when things are > > badly > > > broken, we get some early pain rather than lots of late pain. > > > > > > D > > > > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <k...@simplegeo.com> wrote: > > > > > > > Hi, > > > > > > > > I'm processing gzipped compressed files in a directory, but some > files > > > are > > > > corrupted and can't be decompressed. Is there a way to skip the bad > > > files > > > > with a custom load func? > > > > > > > > -Kim > > > > > > > > > >