I'm proposing new project for ZFS community - Block Selection Policy and 
Space Map Enhancements.

Space map[1] is very efficient data structure for keeping track of free 
space in the metaslabs, but there is at least one area of improvement - 
space map block selection algorithm which could be better (see [2]).

I propose community project to address the following:

* define requirements for a block selection policy for various workloads

* improve current space map / metaslab implementation to allow for 
efficient implementation of multiple block selection policies

* develop a collection of the block selection policies optimized for 
various workloads and requirements


Some background, motivation and requirements you may find in the writeup 
below.

With the best regards,
Victor


Background:
===========
Current block selection algorithm as implemented in metaslab_ff_alloc() 
caused some pain recently (see [3-9,10] for examples), and as a result 
bug 6495013 "Loops and recursion in metaslab_ff_alloc can kill performance,
even on a pool with lots of free data" [11] was filed. Investigation of 
this bug identified race condition in the metaslab selection code (see 
metaslab_group_alloc() for details), and this indeed provided some 
relief but did not solve the problem completely (see [4,10] for example).


Current Limitations:
====================
Synopsis of bug 6495013 suggests that it is loop and recursion in 
metaslab_ff_alloc() which may kill performance. Indeed, loops and 
recursion in metalab_ff_alloc() make it O(NlogN) algorithm in the worst 
case, where N is a number of segments in the space map. Free space 
fragmentation and alignment requirements of metaslab_ff_alloc() may help 
this to surface even on a pool with lots of free space.

For example, let's imagine pool consisting of only one 256k metaslab 
with two allocated 512k blocks - one in the beginning, another one at 
the end. Space map for this metaslab will contain only one space segment 
[512,261632). Attempt to allocate 128k block from this space map will 
fail, though we definitely have 255k of contiguous free space in this case.

Gang blocks came to rescue us here (see [13]). We start trying to 
allocate smaller block sizes - 64k, 32k and so on, until we allocate 
enough of smaller blocks to satisfy allocation of 128k block. In this 
case, depending on the position of 64k-aligned and 512b-aligned cursors 
(see use of sm_ppd in metaslab_ff_alloc()), we may get multiple variants 
of smaller block locations (GH is gang header, GM1 and GM2 are gang 
members, A indicates allocated blocks, and F - free space in the space map):

             GH             GM1       GM2
A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
F:              [1k-64k)                   [192k-255,5k)

In this case we effectively allocate 128k block not aligned on 128k 
boundary with additional overhead of gang header.

             GH             GM2       GM1
A:[0-512)[512-1k)      [64k-128k)[128k-192k)           [255,5k-256k)
F:              [1k-64k)                   [192k-255,5k)

In this case halves of our 128k are swapped.

In case where 512b-aligned cursor points to offset 64k and we allocate 
GH starting at that offset, we'll have to allocate 3 gang members - one 
64k and two 32k, fragmenting free space further.

Other option would be to walk trough space map and take as much free 
space segment as needed to satisfy an allocation. With our example it 
would look like this:

             GH      GM1
A:[0-512)[512-1k)[1k-129k)           [255,5k-256k)
F:                       [129k-255,5k)

But do we really need overhead of gang header in this case? It looks 
like we can definitely do better and it is a bit early to "stop looking 
and start ganging" (see [12]). It may be better to look smarter instead.

Potential number of free space segments in space map grows with the size 
of space map. Since number of metaslabs per vdev is somewhat fixed (see 
[14]), larger vdevs have larger space maps and larger potential number 
of free space segments in them, thus higher chances of worst-case 
behaviour. This may be mitigated by increasing number of metaslabs per 
vdev, but this will bring other difficulties.

Requirement:
============
Another implementations of block allocation functions and/or space map 
data structure may make allocation procedure better space map citizen 
with O(logN) bound, may provide additional information like size of the 
largest free space segment in the space map. This may translate into 
more compact space maps, improvements in metaslab selection, more timely 
and efficient defragmentation, and other benefits for performance and 
predictability of ZFS.


Motivation:
===========
Better block selection policies will provide better and/or more 
predictable performance with empty/full/fragmented storage pools, will 
help to use current ZFS features like snapshots etc more effectively, 
will also make ZFS more friendly for different storage devices like 
Flash card, thin provisioned storage etc.

Links:
======

[1] http://blogs.sun.com/bonwick/entry/space_maps

[2] http://blogs.sun.com/bonwick/entry/zfs_block_allocation

[3] nfsd threads hang in ZFS
http://www.opensolaris.org/jive/thread.jspa?messageID=43565

[4] Snapshot impact on performance
http://www.opensolaris.org/jive/thread.jspa?messageID=64379

[5] NFS/ZFS performance problems - txg_wait_open() deadlocks?
http://www.opensolaris.org/jive/thread.jspa?messageID=92885

[6] zpool export consumes whole CPU and takes more than 30 minutes to 
complete
http://www.opensolaris.org/jive/thread.jspa?messageID=93159

[7] Best Practises => Keep Pool Below 80%?
http://www.opensolaris.org/jive/thread.jspa?messageID=93210

[8] ZFS performance problems - solved
http://www.opensolaris.org/jive/thread.jspa?messageID=102762

[9] 120473-05
http://www.opensolaris.org/jive/thread.jspa?messageID=109078

[10] ZFS performance and memory consumption
http://www.opensolaris.org/jive/thread.jspa?messageID=135184

[10] ZFS pool fragmentation
http://www.opensolaris.org/jive/thread.jspa?messageID=136349

[11] 6495013 Loops and recursion in metaslab_ff_alloc can kill 
performance, even on a pool with lots of free data
http://bugs.opensolaris.org/view_bug.do?bug_id=6495013

[12] 6596237 Stop looking and start ganging
http://bugs.opensolaris.org/view_bug.do?bug_id=6596237

[13] zio_write_allocate_gang_members()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#1248

[14] vdev_ms_shift calculation in vdev_init()
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev.c#1113


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to