Sebastian Nagel created HADOOP-15543: ----------------------------------------
Summary: IndexOutOfBoundsException when reading bzip2-compressed SequenceFile Key: HADOOP-15543 URL: https://issues.apache.org/jira/browse/HADOOP-15543 Project: Hadoop Common Issue Type: Bug Affects Versions: 3.1.0 Reporter: Sebastian Nagel When reading a bzip2-compressed SequenceFile, Hadoop jobs fail with: {noformat} IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046) {noformat} The SequenceFile (669 MB) has been written with the properties - mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec - mapreduce.output.fileoutputformat.compress.type=BLOCK using the native bzip2 library on Hadoop CDH 5.14.2 (Ubuntu 16.04, libbz2-1.0 1.0.6-8). The error was seen on two development systems (local mode, no native bzip2 lib configured/installed) and, so far, is reproducible with Hadoop 3.1.0 and CDH 5.14.2. The following Hadoop releases are not affected: 2.7.4, 3.02, CDH 5.14.0. The SequenceFile is read successfully when these Hadoop packages are used. If required I can share the SequenceFile. It's a Nutch CrawlDb (contains [CrawlDatum|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java] objects. Full-stack as seen with 3.1.0: {noformat} 2018-06-15 10:34:43,198 INFO mapreduce.Job - map 93% reduce 0% 2018-06-15 10:34:43,532 WARN mapred.LocalJobRunner - job_local543410164_0001 java.lang.Exception: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046). at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) Caused by: java.lang.IndexOutOfBoundsException: offs(477658) + len(477659) > dest.length(678046). at org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:398) at org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:496) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.WritableUtils.readString(WritableUtils.java:125) at org.apache.hadoop.io.WritableUtils.readStringArray(WritableUtils.java:169) at org.apache.nutch.protocol.ProtocolStatus.readFields(ProtocolStatus.java:177) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:188) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:332) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2358) at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:568) at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org