Re: Interesting Hadoop/FUSE-DFS access patterns

Brian Bockelman Thu, 16 Apr 2009 04:54:12 -0700

Hey Tom,

Yup, that's one of the things I've been looking at - however, itdoesn't appear to be the likely culprit as to why data access isfairly random. The time the operation took does not seem to be afactor of the number of bytes read, at least in the smaller range.


Brian

On Apr 16, 2009, at 5:17 AM, Tom White wrote:

Not sure if will affect your findings, but when you read from a
FSDataInputStream you should see how many bytes were actually read by
inspecting the return value and re-read if it was fewer than you want.
See Hadoop's IOUtils readFully() method.

Tom
On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman<[email protected]> wrote:
Hey Todd,
Been playing more this morning after thinking about it for thenight -- Ithink the culprit is not the network, but actually the cache.Here's theoutput of your script adjusted to do the same calls as I was doing(you had
left out the random I/O part).

[br...@red tmp]$ java hdfs_tester
Mean value for reads of size 0: 0.0447
Mean value for reads of size 16384: 10.4872
Mean value for reads of size 32768: 10.82925
Mean value for reads of size 49152: 6.2417
Mean value for reads of size 65536: 7.0511003
Mean value for reads of size 81920: 9.411599
Mean value for reads of size 98304: 9.378799
Mean value for reads of size 114688: 8.99065
Mean value for reads of size 131072: 5.1378503
Mean value for reads of size 147456: 6.1324
Mean value for reads of size 163840: 17.1187
Mean value for reads of size 180224: 6.5492
Mean value for reads of size 196608: 8.45695
Mean value for reads of size 212992: 7.4292
Mean value for reads of size 229376: 10.7843
Mean value for reads of size 245760: 9.29095
Mean value for reads of size 262144: 6.57865

Copy of the script below.
So, without the FUSE layer, we don't see much (if any) patternshere. Theoverhead of randomly skipping through the file is higher than theoverhead
of reading out the data.
Upon further inspection, the biggest factor affecting the FUSElayer isactually the Linux VFS caching -- if you notice, the bandwidth inthe givengraph for larger read sizes is *higher* than 1Gbps, which is thelimit ofthe network on that particular node. If I go in the oppositedirection -starting with the largest reads first, then going down to thesmallestreads, the graph entirely smooths out for the small values -everything is
read from the filesystem cache in the client RAM.  Graph attached.
So, on the upside, mounting through FUSE gives us the opportunityto speedup reads for very complex, non-sequential patterns - for free,thanks to thehardworking Linux kernel. On the downside, it's incrediblydifficult tocome up with simple cases to demonstrate performance for anapplication --the cache performance and size depends on how much activity there'son theclient, the previous file system activity that the application did,and theamount of concurrent activity on the server. I can give youresults forperformance, but it's not going to be the performance you see inreal life.
 (Gee, if only file systems were easy...)
Ok, sorry for the list noise -- it seems I'm going to have to thinkmore
about this problem before I can come up with something coherent.

Brian





import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.conf.Configuration;
import java.io.IOException;
import java.net.URI;
import java.util.Random;

public class hdfs_tester {
 public static void main(String[] args) throws Exception {
  URI uri = new URI("hdfs://hadoop-name:9000/");
  FileSystem fs = FileSystem.get(uri, new Configuration());
  Path path = new
Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33");
  FSDataInputStream dis = fs.open(path);
  Random rand = new Random();
  FileStatus status = fs.getFileStatus(path);
  long file_len = status.getLen();
  int iters = 20;
  for (int size=0; size < 1024*1024; size += 4*4096) {
    long csum = 0;
    for (int i = 0; i < iters; i++) {
      int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
      byte buf[] = new byte[size];
      if (pos < 0)
        pos = 0;
      long st = System.nanoTime();
      dis.read(pos, buf, 0, size);
      long et = System.nanoTime();
      csum += et-st;
//System.out.println(String.valueOf(size) + "\t" +String.valueOf(pos)
+ "\t" + String.valueOf(et - st));
    }
    float csum2 = csum; csum2 /= iters;
System.out.println("Mean value for reads of size " + size + ":" +
(csum2/1000/1000));
  }
  fs.close();
 }
}


On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:
On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <[email protected]>wrote:
Hey Brian,
This is really interesting stuff. I'm curious - have you triedthese same
experiments using the Java API? I'm wondering whether this is
FUSE-specific
or inherent to all HDFS reads. I'll try to reproduce this overhere as
well.
This smells sort of nagle-related to me... if you get a chance,you maywant to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 *1024, andsee if the magic number jumps up to 256KB. If so, I think itshould be a
pretty easy bugfix.
Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn'tactually
used for any socket configuration, so I don't think that will make a
difference... still think networking might be the culprit, though.

-Todd
On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman
<[email protected]>wrote:
Ok, here's something perhaps even more strange. I removed the"seek"
part
out of my timings, so I was only timing the "read" instead ofthe "seek
+
read" as in the first case. I also turned the read-ahead downto 1-byte
(aka, off).

The jump *always* occurs at 128KB, exactly.
I'm a bit befuddled. I know we say that HDFS is optimized forlarge,sequential reads, not random reads - but it seems that it's onebug-fix
away
from being a good general-purpose system. Heck if I can findwhat's
causing
the issues though...

Brian





On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:

Hey all,
I was doing some research on I/O patterns of our applications,and Inoticed the attached pattern. In case if the mail serverstrips out
attachments, I also uploaded it:

http://t2.unl.edu/store/Hadoop_64KB_ra.png
http://t2.unl.edu/store/Hadoop_1024KB_ra.png
This was taken using the FUSE mounts of Hadoop; the first onewas with
a
64KB read-ahead and the second with a 1MB read-ahead. This wastaken
from a
2GB file and randomly 'seek'ed in the file. This was performed20
times for
each read size, advancing in 4KB increments. Each blue dot isthe read
time
of one experiment; the red dot is the median read time for theread
size.
The graphs show the absolute read time.
There's very interesting behavior - it seems that there is achange inbehavior around reads of size of 800KB. The time for the readsgo downsignificantly when you read *larger* files. I thought this wasjust an
artifact of the 64KB read-ahead I set in FUSE, so I upped the
read-ahead
significantly, to 1MB. In this case, the difference betweenthe the
small
read sizes and large read sizes are *very* pronounced. If itwas anartifact from FUSE, I'd expect the place where the changeoccurred
would be
a function of the readahead-size.
Anyone out there who knows the code have any ideas? What couldI be
doing wrong?

Brian

<Hadoop_64KB_ra.png>

<Hadoop_1024KB_ra.png>

Re: Interesting Hadoop/FUSE-DFS access patterns

Reply via email to