On Sep 15, 2007, at 12:55 PM, Victor Latushkin wrote: > I'm proposing new project for ZFS community - Block Selection > Policy and > Space Map Enhancements.
+1. I wonder if some of this could look into a dynamic policy. For example, a policy that switches when the pool becomes "too full". eric > > Space map[1] is very efficient data structure for keeping track of > free > space in the metaslabs, but there is at least one area of > improvement - > space map block selection algorithm which could be better (see [2]). > > I propose community project to address the following: > > * define requirements for a block selection policy for various > workloads > > * improve current space map / metaslab implementation to allow for > efficient implementation of multiple block selection policies > > * develop a collection of the block selection policies optimized for > various workloads and requirements > > > Some background, motivation and requirements you may find in the > writeup > below. > > With the best regards, > Victor > > > Background: > =========== > Current block selection algorithm as implemented in > metaslab_ff_alloc() > caused some pain recently (see [3-9,10] for examples), and as a result > bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill > performance, > even on a pool with lots of free data" [11] was filed. > Investigation of > this bug identified race condition in the metaslab selection code (see > metaslab_group_alloc() for details), and this indeed provided some > relief but did not solve the problem completely (see [4,10] for > example). > > > Current Limitations: > ==================== > Synopsis of bug 6495013 suggests that it is loop and recursion in > metaslab_ff_alloc() which may kill performance. Indeed, loops and > recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the > worst > case, where N is a number of segments in the space map. Free space > fragmentation and alignment requirements of metaslab_ff_alloc() may > help > this to surface even on a pool with lots of free space. > > For example, let's imagine pool consisting of only one 256k metaslab > with two allocated 512k blocks - one in the beginning, another one at > the end. Space map for this metaslab will contain only one space > segment > [512,261632). Attempt to allocate 128k block from this space map will > fail, though we definitely have 255k of contiguous free space in > this case. > > Gang blocks came to rescue us here (see [13]). We start trying to > allocate smaller block sizes - 64k, 32k and so on, until we allocate > enough of smaller blocks to satisfy allocation of 128k block. In this > case, depending on the position of 64k-aligned and 512b-aligned > cursors > (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple > variants > of smaller block locations (GH is gang header, GM1 and GM2 are gang > members, A indicates allocated blocks, and F - free space in the > space map): > > GH GM1 GM2 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case we effectively allocate 128k block not aligned on 128k > boundary with additional overhead of gang header. > > GH GM2 GM1 > A:[0-512)[512-1k) [64k-128k)[128k-192k) [255,5k-256k) > F: [1k-64k) [192k-255,5k) > > In this case halves of our 128k are swapped. > > In case where 512b-aligned cursor points to offset 64k and we allocate > GH starting at that offset, we'll have to allocate 3 gang members - > one > 64k and two 32k, fragmenting free space further. > > Other option would be to walk trough space map and take as much free > space segment as needed to satisfy an allocation. With our example it > would look like this: > > GH GM1 > A:[0-512)[512-1k)[1k-129k) [255,5k-256k) > F: [129k-255,5k) > > But do we really need overhead of gang header in this case? It looks > like we can definitely do better and it is a bit early to "stop > looking > and start ganging" (see [12]). It may be better to look smarter > instead. > > Potential number of free space segments in space map grows with the > size > of space map. Since number of metaslabs per vdev is somewhat fixed > (see > [14]), larger vdevs have larger space maps and larger potential number > of free space segments in them, thus higher chances of worst-case > behaviour. This may be mitigated by increasing number of metaslabs per > vdev, but this will bring other difficulties. > > Requirement: > ============ > Another implementations of block allocation functions and/or space map > data structure may make allocation procedure better space map citizen > with O(logN) bound, may provide additional information like size of > the > largest free space segment in the space map. This may translate into > more compact space maps, improvements in metaslab selection, more > timely > and efficient defragmentation, and other benefits for performance and > predictability of ZFS. > > > Motivation: > =========== > Better block selection policies will provide better and/or more > predictable performance with empty/full/fragmented storage pools, will > help to use current ZFS features like snapshots etc more effectively, > will also make ZFS more friendly for different storage devices like > Flash card, thin provisioned storage etc. > > Links: > ====== > > [1] http://blogs.sun.com/bonwick/entry/space_maps > > [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation > > [3] nfsd threads hang in ZFS > http://www.opensolaris.org/jive/thread.jspa?messageID=43565 > > [4] Snapshot impact on performance > http://www.opensolaris.org/jive/thread.jspa?messageID=64379 > > [5] NFS/ZFS performance problems - txg_wait_open() deadlocks? > http://www.opensolaris.org/jive/thread.jspa?messageID=92885 > > [6] zpool export consumes whole CPU and takes more than 30 minutes to > complete > http://www.opensolaris.org/jive/thread.jspa?messageID=93159 > > [7] Best Practises => Keep Pool Below 80%? > http://www.opensolaris.org/jive/thread.jspa?messageID=93210 > > [8] ZFS performance problems - solved > http://www.opensolaris.org/jive/thread.jspa?messageID=102762 > > [9] 120473-05 > http://www.opensolaris.org/jive/thread.jspa?messageID=109078 > > [10] ZFS performance and memory consumption > http://www.opensolaris.org/jive/thread.jspa?messageID=135184 > > [10] ZFS pool fragmentation > http://www.opensolaris.org/jive/thread.jspa?messageID=136349 > > [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill > performance, even on a pool with lots of free data > http://bugs.opensolaris.org/view_bug.do?bug_id=6495013 > > [12] 6596237 Stop looking and start ganging > http://bugs.opensolaris.org/view_bug.do?bug_id=6596237 > > [13] zio_write_allocate_gang_members() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/zio.c#1248 > > [14] vdev_ms_shift calculation in vdev_init() > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ > common/fs/zfs/vdev.c#1113 > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss