Le 7 févr. 2011 à 17:08, Yi Zhang a écrit :

> On Mon, Feb 7, 2011 at 10:26 AM, Roch <roch.bourbonn...@oracle.com> wrote:
>> 
>> Le 7 févr. 2011 à 06:25, Richard Elling a écrit :
>> 
>>> On Feb 5, 2011, at 8:10 AM, Yi Zhang wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> I'm trying to achieve the same effect of UFS directio on ZFS and here
>>>> is what I did:
>>> 
>>> Solaris UFS directio has three functions:
>>>       1. improved async code path
>>>       2. multiple concurrent writers
>>>       3. no buffering
>>> 
>>> Of the three, #1 and #2 were designed into ZFS from day 1, so there is 
>>> nothing
>>> to set or change to take advantage of the feature.
>>> 
>>>> 
>>>> 1. Set the primarycache of zfs to metadata and secondarycache to none,
>>>> recordsize to 8K (to match the unit size of writes)
>>>> 2. Run my test program (code below) with different options and measure
>>>> the running time.
>>>> a) open the file without O_DSYNC flag: 0.11s.
>>>> This doesn't seem like directio is in effect, because I tried on UFS
>>>> and time was 2s. So I went on with more experiments with the O_DSYNC
>>>> flag set. I know that directio and O_DSYNC are two different things,
>>>> but I thought the flag would force synchronous writes and achieve what
>>>> directio does (and more).
>>> 
>>> Directio and O_DSYNC are two different features.
>>> 
>>>> b) open the file with O_DSYNC flag: 147.26s
>>> 
>>> ouch
>> 
>> how big a file ?
>> Does the resuld holds if you don't truncate ?

OK, if it had been a 2TB file, I could have seen an opening. Not for 80M though.
So it's baffling.... Unless !

It's not just the open which takes 147s it's the whole run, 10000 writes.
10000 sync writes without an SDD would take at 150 second at 68 IO/S.

with the O_DSYNC flag then all writes are to memory so it's expected to take 
0.11s to tracne
80000K at 750MB/sec (memcopy speed).

O_DSYNC + zfs_nocacheflush is in between. Every write transfers data to an 
unstable cache but then does not flush it.
At some point the cache might overflow and so some write have high latency 
while the data is transfering from disk cache to disk platter.

So those results are inline with what everybody has been seeing before

Note that to compare with UFS, since UFS does not cache flush after every sync 
write like ZFS correctly does
you have to compare UFS + write cache disabled to ZFS (with or without write 
cache). 

After deleting a zfs pool, the disk write is left on and so a UFS  filesystem 
wil appear inordinately fast then
unless you turn off the write cache with "format -e; cache , write_cache; 
disable."

-r


>> 
>> -r
>> 
> The file is 8K*10000 about 80M. I removed the O_TRUNC flag and the
> results stayed the same...
> 
>>> 
>>>> c) same as b) but also enabled zfs_nocacheflush: 5.87s
>>> 
>>> Is your pool created from a single HDD?
>>> 
>>>> My questions are:
>>>> 1. With my primarycache and secondarycache settings, the FS shouldn't
>>>> buffer reads and writes anymore. Wouldn't that be equivalent to
>>>> O_DSYNC? Why a) and b) are so different?
>>> 
>>> No. O_DSYNC deals with when the I/O is committed to media.
>>> 
>>>> 2. My understanding is that zfs_nocacheflush essentially removes the
>>>> sync command sent to the device, which cancels the O_DSYNC flag. Why
>>>> b) and c) are so different?
>>> 
>>> No. Disabling the cache flush means that the volatile write buffer in the
>>> disk is not flushed. In other words, disabling the cache flush is in direct
>>> conflict with the semantics of O_DSYNC.
>>> 
>>>> 3. Does ZIL have anything to do with these results?
>>> 
>>> Yes. The ZIL is used for meeting the O_DSYNC requirements.  This has
>>> nothing to do with buffering. More details are on the ZFS Best Practices 
>>> Guide.
>>> -- richard
>>> 
>>>> 
>>>> Thanks in advance for any suggestion/insight!
>>>> Yi
>>>> 
>>>> 
>>>> #include <fcntl.h>
>>>> #include <sys/time.h>
>>>> 
>>>> int main(int argc, char **argv)
>>>> {
>>>>  struct timeval tim;
>>>>  gettimeofday(&tim, NULL);
>>>>  double t1 = tim.tv_sec + tim.tv_usec/1000000.0;
>>>>  char a[8192];
>>>>  int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0660);
>>>>  //int fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC|O_DSYNC, 0660);
>>>>  if (argv[2][0] == '1')
>>>>      directio(fd, DIRECTIO_ON);
>>>>  int i;
>>>>  for (i=0; i<10000; ++i)
>>>>      pwrite(fd, a, sizeof(a), i*8192);
>>>>  close(fd);
>>>>  gettimeofday(&tim, NULL);
>>>>  double t2 = tim.tv_sec + tim.tv_usec/1000000.0;
>>>>  printf("%f\n", t2-t1);
>>>> }
>>>> _______________________________________________
>>>> zfs-discuss mailing list
>>>> zfs-discuss@opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>> 
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss@opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>> 
>> 

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to