Hi all,
We are running Yahoo distribution of Hadoop based on Hadoop 0.20.0-2787265 .
On a 10 nodes cluster with OpenSUSE Linux Operating System. We have HDFS
configured with Block Size 5GB (This is for our experiments). But we are
facing following problems when we try reading the data beyond 1GB from 5GB
input split.

*1) Problem in DFSClient*
 When we read the text data, the map reduce job was getting stuck after
reading about first 1GB of data from the split. Thereafter it was unable to
read anymore data using IOUtils.readFully() and after 10 such minutes Hadoop
got timed out. Then we tried to manually skip first 2GB from the 5GB split
and start reading from 2GB offset inside the split. There skipping was ok
but reading the first line after skipping 2GB gave the following exception:

java.lang.IndexOutOfBoundsException
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:151)
at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1118)
 at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1666)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1716)
 at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
 at
edu.unisb.cs.mapreduce.binary.load.LoaderRecordReader.<init>(LoaderRecordReader.java:101)
at
edu.unisb.cs.mapreduce.binary.load.LoaderInputFormat.getRecordReader(LoaderInputFormat.java:45)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:337)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:306)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

This was particularly strange. We looked at line number 151 of
FSInputChecker where the exception was thrown and it looks like below:

public synchronized int read(byte[] b, int off, int len) throws IOException
{
    if ((off | len | (off + len) | (b.length - (off + len))) < 0) {
      throw new IndexOutOfBoundsException();
    }

It looks like parameters "off" or "len" got negative. We traced the stack
back and found following interesting thing at lines 1715 and 1716 in
DFSClient:

            int realLen = Math.min(len, (int) (blockEnd - pos + 1));
            int result = readBuffer(buf, off, realLen);

The first line takes int of (blockEnd - pos +1) where blockEnd is absolute
end position of the block (split) in the file and pos is the current
position, both as long. But after taking their integer value it might
overflow! and make realLen negative, thereby giving
IndexOutOfBoudsException.

This problem we fixed by modifying  DFSClient.java:1715 to

int realLen = (int)Math.min(len,  (blockEnd - pos + 1));

*2) Problem with  IOUtils.skipFully() *

We fixed the problem of casting from long to int, but we got a new problem,
i.e. we are not able to read more than 1GB of data. When we
did  IOUtils.skipFully() and skipped 1GB and treid to read from there, we
following exception.

java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1218)
        at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
        at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1117)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1665)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1720)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
        at
edu.unisb.cs.mapreduce.binary.load.LoaderRecordReader.<init>(LoaderRecordReader.java:107)
        at
edu.unisb.cs.mapreduce.binary.load.LoaderInputFormat.getRecordReader(LoaderInputFormat.java:45)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

3) *java.net.SocketTimeoutException*

Since there seemed to be some problem with IOUtils.skipFully() we started
seeking just before reading each record. We do not get End of File exception
anymore but get the following new exception after reading approx 1810517942
bytes (around 1.5GB).

java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/134.96.223.140:48255remote=/134.96.223.140:50010]
 at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
 at
org.apache.hadoop.hdfs.DFSClient$BlockReader.readChunk(DFSClient.java:1218)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
 at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
 at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1117)
 at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1665)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1720)
 at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
 at
edu.unisb.cs.mapreduce.binary.load.LoaderRecordReader.next(LoaderRecordReader.java:163)
at
edu.unisb.cs.mapreduce.binary.load.LoaderRecordReader.next(LoaderRecordReader.java:1)
 at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Steps to reproduce:

   1. Configure HDFS with 5GB block size
   2. Load text data of > 5GB to HDFS
   3. Run Grep or wordcount on the data

We have following questions:


   1. Does any body know what is the maximum block size and split size
   supported by HDFS?
   2. Did any of you guys try something similar?
   3. Did any of you guys get similar problems? If so can you please tell us
   how you resolved it?

Thank you,
Vinay Setty

Reply via email to