On Sep 15, 2007, at 12:55 PM, Victor Latushkin wrote:

> I'm proposing new project for ZFS community - Block Selection  
> Policy and
> Space Map Enhancements.

+1.

I wonder if some of this could look into a dynamic policy.  For  
example, a policy that switches when the pool becomes "too full".

eric

>
> Space map[1] is very efficient data structure for keeping track of  
> free
> space in the metaslabs, but there is at least one area of  
> improvement -
> space map block selection algorithm which could be better (see [2]).
>
> I propose community project to address the following:
>
> * define requirements for a block selection policy for various  
> workloads
>
> * improve current space map / metaslab implementation to allow for
> efficient implementation of multiple block selection policies
>
> * develop a collection of the block selection policies optimized for
> various workloads and requirements
>
>
> Some background, motivation and requirements you may find in the  
> writeup
> below.
>
> With the best regards,
> Victor
>
>
> Background:
> ===========
> Current block selection algorithm as implemented in  
> metaslab_ff_alloc()
> caused some pain recently (see [3-9,10] for examples), and as a result
> bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill  
> performance,
> even on a pool with lots of free data" [11] was filed.  
> Investigation of
> this bug identified race condition in the metaslab selection code (see
> metaslab_group_alloc() for details), and this indeed provided some
> relief but did not solve the problem completely (see [4,10] for  
> example).
>
>
> Current Limitations:
> ====================
> Synopsis of bug 6495013 suggests that it is loop and recursion in
> metaslab_ff_alloc() which may kill performance. Indeed, loops and
> recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the  
> worst
> case, where N is a number of segments in the space map. Free space
> fragmentation and alignment requirements of metaslab_ff_alloc() may  
> help
> this to surface even on a pool with lots of free space.
>
> For example, let's imagine pool consisting of only one 256k metaslab
> with two allocated 512k blocks - one in the beginning, another one at
> the end. Space map for this metaslab will contain only one space  
> segment
> [512,261632). Attempt to allocate 128k block from this space map will
> fail, though we definitely have 255k of contiguous free space in  
> this case.
>
> Gang blocks came to rescue us here (see [13]). We start trying to
> allocate smaller block sizes - 64k, 32k and so on, until we allocate
> enough of smaller blocks to satisfy allocation of 128k block. In this
> case, depending on the position of 64k-aligned and 512b-aligned  
> cursors
> (see use of sm_ppd in metaslab_ff_alloc()), we may get multiple  
> variants
> of smaller block locations (GH is gang header, GM1 and GM2 are gang
> members, A indicates allocated blocks, and F - free space in the  
> space map):
>
>              GH             GM1       GM2
> A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
> F:              [1k-64k)                   [192k-255,5k)
>
> In this case we effectively allocate 128k block not aligned on 128k
> boundary with additional overhead of gang header.
>
>              GH             GM2       GM1
> A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
> F:              [1k-64k)                   [192k-255,5k)
>
> In this case halves of our 128k are swapped.
>
> In case where 512b-aligned cursor points to offset 64k and we allocate
> GH starting at that offset, we'll have to allocate 3 gang members -  
> one
> 64k and two 32k, fragmenting free space further.
>
> Other option would be to walk trough space map and take as much free
> space segment as needed to satisfy an allocation. With our example it
> would look like this:
>
>              GH      GM1
> A:[0-512)[512-1k)[1k-129k)           [255,5k-256k)
> F:                       [129k-255,5k)
>
> But do we really need overhead of gang header in this case? It looks
> like we can definitely do better and it is a bit early to "stop  
> looking
> and start ganging" (see [12]). It may be better to look smarter  
> instead.
>
> Potential number of free space segments in space map grows with the  
> size
> of space map. Since number of metaslabs per vdev is somewhat fixed  
> (see
> [14]), larger vdevs have larger space maps and larger potential number
> of free space segments in them, thus higher chances of worst-case
> behaviour. This may be mitigated by increasing number of metaslabs per
> vdev, but this will bring other difficulties.
>
> Requirement:
> ============
> Another implementations of block allocation functions and/or space map
> data structure may make allocation procedure better space map citizen
> with O(logN) bound, may provide additional information like size of  
> the
> largest free space segment in the space map. This may translate into
> more compact space maps, improvements in metaslab selection, more  
> timely
> and efficient defragmentation, and other benefits for performance and
> predictability of ZFS.
>
>
> Motivation:
> ===========
> Better block selection policies will provide better and/or more
> predictable performance with empty/full/fragmented storage pools, will
> help to use current ZFS features like snapshots etc more effectively,
> will also make ZFS more friendly for different storage devices like
> Flash card, thin provisioned storage etc.
>
> Links:
> ======
>
> [1] http://blogs.sun.com/bonwick/entry/space_maps
>
> [2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation
>
> [3] nfsd threads hang in ZFS
> http://www.opensolaris.org/jive/thread.jspa?messageID=43565
>
> [4] Snapshot impact on performance
> http://www.opensolaris.org/jive/thread.jspa?messageID=64379
>
> [5] NFS/ZFS performance problems - txg_wait_open() deadlocks?
> http://www.opensolaris.org/jive/thread.jspa?messageID=92885
>
> [6] zpool export consumes whole CPU and takes more than 30 minutes to
> complete
> http://www.opensolaris.org/jive/thread.jspa?messageID=93159
>
> [7] Best Practises => Keep Pool Below 80%?
> http://www.opensolaris.org/jive/thread.jspa?messageID=93210
>
> [8] ZFS performance problems - solved
> http://www.opensolaris.org/jive/thread.jspa?messageID=102762
>
> [9] 120473-05
> http://www.opensolaris.org/jive/thread.jspa?messageID=109078
>
> [10] ZFS performance and memory consumption
> http://www.opensolaris.org/jive/thread.jspa?messageID=135184
>
> [10] ZFS pool fragmentation
> http://www.opensolaris.org/jive/thread.jspa?messageID=136349
>
> [11] 6495013 Loops and recursion in metaslab_ff_alloc can kill
> performance, even on a pool with lots of free data
> http://bugs.opensolaris.org/view_bug.do?bug_id=6495013
>
> [12] 6596237 Stop looking and start ganging
> http://bugs.opensolaris.org/view_bug.do?bug_id=6596237
>
> [13] zio_write_allocate_gang_members()
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ 
> common/fs/zfs/zio.c#1248
>
> [14] vdev_ms_shift calculation in vdev_init()
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/ 
> common/fs/zfs/vdev.c#1113
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to