ZFS Group,

        My two cents..

        Currently, in  my experience, it is a waste of time to try to
        guarantee "exact" location of disk blocks with any FS.

        A simple reason exception is bad blocks, a neighboring block
        will suffice.

        Second, current disk controllers have logic that translates
        and you can't be sure outside of the firmware where the
        disk block actually is. Yes, I wrote code in this area before.

        Third, some FSs, do a Read-Modify-Write, where the write is
        NOT, NOT, NOT overwriting the original location of the read.

        Why? for a couple of reasons.. One is that the original read
        may have existed in a fragment. Some do it for FS consistency
        to allow the write to become a partial write in some
        circumstances (Ex:crash), and the second file block location
        then allows for FS consistency and the ability to recover the
        original contents. No overwite.

        Another reason is that sometimes we are filling a hole
        within a FS object window from a base addr to new offset.
        The ability to concatenate allows us to reduce the number of
        future seeks and small reads / writes versus having a slightly
        longer transfer time for the larger theorectical disk block.

        Thus, the tradeoff is that we accept that we waste some FS
        space, we may not fully optimize the location of the disk
        block,  we have larger read and write single large block
        latencies, but... we seek less, the per byte overhead is
        less, we can order our writes so that we again seek less, our
        writes can be delayed (assuming that we might write multiple
        times and then commmit on close) to minimize the amount of
        actual write operations, we can prioritize our reads over
        our writes to decrease read latency, etc..

        Bottom line is that performance may suffer if we do alot
        of random small read-modify-writes within FS objects that
        use a very large disk block. Since the actual CHANGE is
        small to the file, each small write outside of a delayed
        write window, will consume at least 1 disk block. However,
        some writes are to FS objects that are writethru and thus
        each small write will consume a new disk block.

        Mitchell Erblich
        -----------------

        

Roch - PAE wrote:
> 
> Jeff Davis writes:
>  > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis
>  > > But you have to be aware that logically sequential
>  > > reads do not
>  > > necessarily translate into physically sequential
>  > > reads with zfs.  zfs
>  >
>  > I understand that the COW design can fragment files. I'm still trying to 
> understand how that would affect a database. It seems like that may be bad 
> for performance on single disks due to the seeking, but I would expect that 
> to be less significant when you have many spindles. I've read the following 
> blogs regarding the topic, but didn't find a lot of details:
>  >
>  > http://blogs.sun.com/bonwick/entry/zfs_block_allocation
>  > http://blogs.sun.com/realneel/entry/zfs_and_databases
>  >
>  >
> 
> Here is my take on this:
> 
> DB updates (writes) are mostly  governed by the  synchronous
> write  code  path which for ZFS   means the ZIL performance.
> It's already quite good  in   that it aggregates    multiple
> updates into few I/Os.  Some further improvements are in the
> works.  COW, in general, simplify greatly write code path.
> 
> DB reads in a transaction  workloads  are mostly random.  If
> the DB  is not cacheable the performance  will  be that of a
> head seek no matter what FS is used (since we can't guess in
> advance where to seek, COW nature does  not help nor hinders
> performance).
> 
> DB reads in a decision workloads can benefit from good
> prefetching (since here we actually know where the next
> seeks will be).
> 
> -r
> 
>  > This message posted from opensolaris.org
>  > _______________________________________________
>  > zfs-discuss mailing list
>  > zfs-discuss@opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to