Tim Cook writes: > On Sun, Dec 27, 2009 at 6:43 PM, Bob Friesenhahn < > bfrie...@simple.dallas.tx.us> wrote: > > > On Sun, 27 Dec 2009, Tim Cook wrote: > > > >> > >> That is ONLY true when there's significant free space available/a fresh > >> pool. Once those files have been deleted and the blocks put back into the > >> free pool, they're no longer "sequential" on disk, they're all over the > >> disk. So it makes a VERY big difference. I'm not sure why you'd be > >> shocked > >> someone would bring this up. -- > >> > > > > While I don't know what zfs actually does, I do know that it performs large > > disk allocations (e.g. 1MB) and then parcels 128K zfs blocks from those > > allocations. If the zfs designers are wise, then they will use knowledge > > of > > sequential access to ensure that all of the 128K blocks from a metaslab > > allocation are pre-assigned for use by that file, and they will try to > > choose metaslabs which are followed by free metaslabs, or close to other > > free metaslabs. This approach would tend to limit the sequential-access > > damage caused by COW and free block fragmentation on a "dirty" disk. > > > > > How is that going to prevent blocks being spread all over the disk when > you've got files several GB in size being written concurrently and deleted > at random? And then throw in a mix of small files as well, kiss that > goodbye. > >
Big files being deleted creates big chunks of space for reuse. That is a great way to clean up the layout. Within a metaslab ZFS uses cursors to bunch small objects closer together. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c#501 > > > This sort of planning is not terribly different than detecting sequential > > read I/O and scheduling data reads in advance of application requirements. > > If you can intelligently pre-fetch data blocks, then you can certainly > > intelligently pre-allocate data blocks. > > > > > Pre-allocating data blocks is also not going to cure head seek and the > latency it induces on slow 7200/5400RPM drives. > > > > > > Today I did an interesting (to me) test where I ran two copies of iozone at > > once on huge (up to 64GB) files. The results were somewhat amazing to me. > > The cause of the amazement was that I noticed that the reported data rates > > from iozone did not drop very much (e.g. a single-process write rate of > > 359MB/second dropped to 298MB/second with two processes). This clearly > > showed that zfs is doing quite a lot of smart things when writing files and > > that it is optimized for several/many writers rather than just one. > > > > > On a new, empty pool, or a pool that's been filled completely and emptied > several times? It's not amazing to me on a new pool. I would be surprised > to see you accomplish this feat repeatedly after filling and emptying the > drives. It's a drawback of every implementation of copy-on-write I've ever > seen. By it's very nature, I have no idea how you would avoid it. > If you empty the drives you're back to all free space : http://blogs.sun.com/bonwick/entry/space_maps If you leave yourselve a nive cushion of free space and if you're profile of object sizes does no radically changes over time, I think people should be fine when it comes to free space fragmentation issues. That said, slab and block selection is still on our radard for improvements. -r _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss