SequenceFile.Reader can't read gzip format compressed sequence file which produce by a mapreduce job without native compression library ---------------------------------------------------------------------------------------------------------------------------------------
Key: HADOOP-6817 URL: https://issues.apache.org/jira/browse/HADOOP-6817 Project: Hadoop Common Issue Type: Bug Components: io Affects Versions: 0.20.2 Environment: Cluster:CentOS 5,jdk1.6.0_20 Client:Mac SnowLeopard,jdk1.6.0_20 Reporter: Wenjun Huang An hadoop job output a gzip compressed sequence file(whether record compressed or block compressed).The client program use SequenceFile.Reader to read this sequence file,when reading the client program shows the following exceptions: 2090 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2091 [main] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor Exception in thread "main" java.io.EOFException at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) at org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:170) at org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:180) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) at com.shiningware.intelligenceonline.taobao.mapreduce.HtmlContentSeqOutputView.main(HtmlContentSeqOutputView.java:28) I studied the code in org.apache.hadoop.io.SequenceFile.Reader.init method and read: // Initialize... *not* if this we are constructing a temporary Reader if (!tempReader) { valBuffer = new DataInputBuffer(); if (decompress) { valDecompressor = CodecPool.getDecompressor(codec); valInFilter = codec.createInputStream(valBuffer, valDecompressor); valIn = new DataInputStream(valInFilter); } else { valIn = valBuffer; } the problem seems to be caused by "valBuffer = new DataInputBuffer();" ,because GzipCodec.createInputStream creates an instance of GzipInputStream whose constructor creates an instance of ResetableGZIPInputStream class.When ResetableGZIPInputStream's constructor calls it base class java.util.zip.GZIPInputStream's constructor ,it trys to read the empty "valBuffer = new DataInputBuffer();" and get no content,so it throws an EOFException. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.