ZFS Group, My two cents..
Currently, in my experience, it is a waste of time to try to guarantee "exact" location of disk blocks with any FS. A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can't be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. Why? for a couple of reasons.. One is that the original read may have existed in a fragment. Some do it for FS consistency to allow the write to become a partial write in some circumstances (Ex:crash), and the second file block location then allows for FS consistency and the ability to recover the original contents. No overwite. Another reason is that sometimes we are filling a hole within a FS object window from a base addr to new offset. The ability to concatenate allows us to reduce the number of future seeks and small reads / writes versus having a slightly longer transfer time for the larger theorectical disk block. Thus, the tradeoff is that we accept that we waste some FS space, we may not fully optimize the location of the disk block, we have larger read and write single large block latencies, but... we seek less, the per byte overhead is less, we can order our writes so that we again seek less, our writes can be delayed (assuming that we might write multiple times and then commmit on close) to minimize the amount of actual write operations, we can prioritize our reads over our writes to decrease read latency, etc.. Bottom line is that performance may suffer if we do alot of random small read-modify-writes within FS objects that use a very large disk block. Since the actual CHANGE is small to the file, each small write outside of a delayed write window, will consume at least 1 disk block. However, some writes are to FS objects that are writethru and thus each small write will consume a new disk block. Mitchell Erblich ----------------- Roch - PAE wrote: > > Jeff Davis writes: > > > On February 26, 2007 9:05:21 AM -0800 Jeff Davis > > > But you have to be aware that logically sequential > > > reads do not > > > necessarily translate into physically sequential > > > reads with zfs. zfs > > > > I understand that the COW design can fragment files. I'm still trying to > understand how that would affect a database. It seems like that may be bad > for performance on single disks due to the seeking, but I would expect that > to be less significant when you have many spindles. I've read the following > blogs regarding the topic, but didn't find a lot of details: > > > > http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > http://blogs.sun.com/realneel/entry/zfs_and_databases > > > > > > Here is my take on this: > > DB updates (writes) are mostly governed by the synchronous > write code path which for ZFS means the ZIL performance. > It's already quite good in that it aggregates multiple > updates into few I/Os. Some further improvements are in the > works. COW, in general, simplify greatly write code path. > > DB reads in a transaction workloads are mostly random. If > the DB is not cacheable the performance will be that of a > head seek no matter what FS is used (since we can't guess in > advance where to seek, COW nature does not help nor hinders > performance). > > DB reads in a decision workloads can benefit from good > prefetching (since here we actually know where the next > seeks will be). > > -r > > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss