now this is getting interesting :-)...

On Dec 30, 2009, at 12:13 PM, Mike Gerdts wrote:

On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
<richard.ell...@gmail.com> wrote:
On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:

Devzero,

Unfortunately that was my assumption as well. I don't have source level knowledge of ZFS, though based on what I know it wouldn't be an easy way to do it. I'm not even sure it's only a technical question, but a design
question, which would make it even less feasible.

It is not hard, because ZFS knows the current free list, so walking that
list
and telling the storage about the freed blocks isn't very hard.

What is hard is figuring out if this would actually improve life. The
reason
I say this is because people like to use snapshots and clones on ZFS.
If you keep snapshots, then you aren't freeing blocks, so the free list doesn't grow. This is a very different use case than UFS, as an example.

It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

Good observation, Mike. ZFS divides a leaf vdev into approximately 200
metaslabs. Space is allocated in a metaslab and at some point another
metaslab will be chosen.  The assumption is made that the outer tracks
of a disk have higher bandwidth than inner tracks, so allocations should
be biased towards lower-numbered metaslabs.  Let's ignore, for the
moment, that SSDs, and to some degree, RAID arrays, don't exhibit
this behavior. OK, so here's how it works, in a nutshell.

Space is allocated in the same metaslab until it fills or becomes
"fragmented" and then the next metaslab is used.  You can see this
in my "Spacemaps from Space" blog,
http://blogs.sun.com/relling/entry/space_maps_from_space
where the lower numbered tracks (towards the bottom) you can see
occasional, small blank areas.  Note to self: a better picture would be
useful :-)

Note: copies are intentionally spread to other, distant metaslabs for
diversity.

Inside the metaslab, space is allocated on a first-fit basis until the
space is mostly consumed and the algorithm changes to best-fit.

The algorithm for these two decisions was changed in b129, in an
effort to improve performance.

So, the questions that arise are:
Should the allocator be made aware of the chunk size of virtual
storage vdevs?  [hint: there is evidence of the intention to permit
different allocators in the source, but I dunno if there is an intent
to expose those through an interface.]

If the allocator can change, what sorts of policies should be
implemented?  Examples include:
        + should the allocator stick with best-fit and encourage more
           gangs when the vdev is virtual?
        + should the allocator be aware of an SSD's page size?  Is
           said page size available to an OS?
        + should the metaslab boundaries align with virtual storage
           or SSD page boundaries?

And, perhaps most important, how can this be done automatically
so that system administrators don't have to be rocket scientists
to make a good choice?

 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to