Re: Interesting Hadoop/FUSE-DFS access patterns

jason hadoop Tue, 14 Apr 2009 07:35:12 -0700

Oh I agree caching, is wonderful when you plan to re-use the data in the
near term.


Solaris has an interesting feature, if the application writes enough
contiguous data, in a short time window, (tunable in later nevada builds),
solaris bypasses the buffer cache for the writes.

For reasons I have never had time to look into, there is a significant
impact on overall system responsiveness when there is significant cache
store activity going on, and there are patterns that work in the general
case but fail in others, the tar example from earlier, it is my theory that
the blocks written to the tar file, take priority over the read ahead, and
so the next file to be read for the tar archive are not pre-cached. Using
the cache flush on the tar file, allows the read aheads to go ahead.
The other nice thing that happens is that the size of the dirty pool tends
not to grow to to the point that the periodic sync operations pause the
system.

We had an interesting problem with solaris under vmware some years back,
where we were running IMAP servers as part of JES for testing a middleware
mail application, The IMAP writes would accumulate in the buffer cache, and
performace would be wonderful, and the middle ware performace was great,
then the must flush now threshold would be crossed and it would take 2
minutes to flush all of the accumulated writes out, and the middle ware app
would block waiting on that to finish. In the end as a quick hack, we did
the following *while true; do sync; sleep 30; done*, which prevented the
stalls as it kept the flush time down. The flushes totally fill the disk
queues and will cause starvation for other apps.

I believe this is part of the block report stall problem in 4584.

On Tue, Apr 14, 2009 at 4:52 AM, Brian Bockelman <[email protected]>wrote:

> Hey Jason,
>
> Thanks, I'll keep this on hand as I do more tests.  I now have a C, Java,
> and python version of my testing program ;)
>
> However, I particularly *like* the fact that there's caching going on -
> it'll help out our application immensely, I think.  I'll be looking at the
> performance both with and without the cache.
>
> Brian
>
>
> On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:
>
>  The following very simple program will tell the VM to drop the pages being
>> cached for a file. I tend to spin this in a for loop when making large tar
>> files, or otherwise working with large files, and the system performance
>> really smooths out.
>> Since it use open(path) it will churn through the inode cache and
>> directories.
>> Something like this might actually significantly speed up HDFS by running
>> over the blocks on the datanodes, for busy clusters.
>>
>>
>> #define _XOPEN_SOURCE 600
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <string.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>>
>> /** Simple program to dump buffered data for specific files from the
>> buffer
>> cache. Copyright Jason Venner 2009, License GPL*/
>>
>> int main( int argc, char** argv )
>> {
>>  int failCount = 0;
>>  int i;
>>  for( i = 1; i < argc; i++ ) {
>>   char* file = argv[i];
>>   int fd = open( file, O_RDONLY|O_LARGEFILE );
>>   if (fd == -1) {
>>     perror( file );
>>     failCount++;
>>     continue;
>>   }
>>   if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
>>     fprintf( stderr, "Failed to flush cache for %s %s\n", argv[optind],
>> strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
>>     failCount++;
>>   }
>>   close(fd);
>>  }
>>  exit( failCount );
>> }
>>
>>
>> On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey <[email protected]
>> >wrote:
>>
>>
>>> On 4/12/09 9:41 PM, "Brian Bockelman" <[email protected]> wrote:
>>>
>>>  Ok, here's something perhaps even more strange.  I removed the "seek"
>>>> part out of my timings, so I was only timing the "read" instead of the
>>>> "seek + read" as in the first case.  I also turned the read-ahead down
>>>> to 1-byte (aka, off).
>>>>
>>>> The jump *always* occurs at 128KB, exactly.
>>>>
>>>
>>> Some random ideas:
>>>
>>> I have no idea how FUSE interops with the Linux block layer, but 128K
>>> happens to be the default 'readahead' value for block devices, which may
>>> just be a coincidence.
>>>
>>> For a disk 'sda', you check and set the value (in 512 byte blocks) with:
>>>
>>> /sbin/blockdev --getra /dev/sda
>>> /sbin/blockdev --setra [num blocks] /dev/sda
>>>
>>>
>>> I know on my file system tests, the OS readahead is not activated until a
>>> series of sequential reads go through the block device, so truly random
>>> access is not affected by this.  I've set it to 128MB and random iops
>>> does
>>> not change on a ext3 or xfs file system.  If this applies to FUSE too,
>>> there
>>> may be reasons that this behavior differs.
>>> Furthermore, one would not expect it to be slower to randomly read 4k
>>> than
>>> randomly read up to the readahead size itself even if it did.
>>>
>>> I also have no idea how much of the OS device queue and block device
>>> scheduler is involved with FUSE.  If those are involved, then there's a
>>> bunch of stuff to tinker with there as well.
>>>
>>> Lastly, an FYI if you don't already know the following.  If the OS is
>>> caching pages, there is a way to flush these in Linux to evict the cache.
>>> See /proc/sys/vm/drop_caches .
>>>
>>>
>>>
>>>
>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>>>> sequential reads, not random reads - but it seems that it's one bug-
>>>> fix away from being a good general-purpose system.  Heck if I can find
>>>> what's causing the issues though...
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Interesting Hadoop/FUSE-DFS access patterns

Reply via email to