Ragnar Sundblad wrote:
On 2 jan 2010, at 13.10, Erik Trimble wrote
Joerg Schilling wrote:
the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use.
See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more
about "smart" vs "dumb" SSD controllers.
From ZFS's standpoint, the optimal configuration would be for the SSD to inform
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size
for that device (i.e. all writes are in integer multiples of the SSD page
size). Reads could be in smaller sections, though. Which would be
interesting: ZFS would write in Page Size increments, and read in Block Size
amounts.
Well, this could be useful if updates are larger than the block size, for
example 512 K, as it is then possible to erase and rewrite without having to
copy around other data from the page. If updates are smaller, zfs will have to
reclaim erased space by itself, which if I am not mistaken it can not do today
(but probably will in some future, I guess the BP Rewrite is what is needed).
Sure, it does that today. What do you think happens on a standard COW
action? Let's be clear here: I'm talking about exactly the same thing
that currently happens when you modify a ZFS "block" that spans multiple
vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC,
the modifications made, then it is written back to storage, likely in
another LBA. The original ZFS block location ON THE VDEV is now
available for re-use (i.e. the vdev adds it to it's Free Block List).
This is one of the things that leads to ZFS's fragmentation issues
(note, we're talking about block fragmentation on the vdev, not ZFS
block fragmentation), and something that we're looking to BP rewrite to
enable defragging to be implemented.
In fact, I would argue that the biggest advantage of removing any
advanced intelligence from the SSD controller is with small
modifications to existing files. By using the L2ARC (and other
features, like compression, encryption, and dedup), ZFS can composite
the needed changes with an existing cached copy of the ZFS block(s) to
be changed, then issue a full new block write to the SSD. This
eliminates the need for the SSD to do the dreaded Read-Modify-Write
cycle, and instead do just a Write. In this scenario, the ZFS block is
likely larger than the SSD Page size, so more data will need to be
written; however, given the highly parallel nature of SSDs, writing
several SSD pages simultaneously is easy (and fast); let's remember
that a ZFS block is a maximum of only 8x the size of a SSD page, and
writing 8 pages is only slightly more work than writing 1 page. This
larger write is all a single IOP, where a R-M-W essentially requires 3
IOPS. If you want the SSD controller to do the work, then it ALWAYS has
to read the to-be-modified page from NAND, do the mod itself, then issue
the write - and, remember, ZFS likely has already issued a full
ZFS-block write (due to the COW nature of ZFS, there is no concept of
"just change this 1 bit and leave everything else on disk where it is"),
so you likely don't save on the number of pages that need to be written
in any case.
I am still not entirely convinced that it would be better to let the file
system take care of that instead of a flash controller, there could be quite a
lot of reading and writing going on for space reclamation (depending on the
work load, of course).
/ragge
The point here is that regardless of the workload, there's a R-M-W cycle
that has to happen, whether that occurs at the ZFS level or at the SSD
level. My argument is that the OS has a far better view of the whole
data picture, and access to much higher performing caches (i.e.
RAM/registers) than the SSD, so not only can the OS make far better
decisions about the data and how (and how much of) it should be stored,
but it's almost certainly to be able to do so far faster than any little
SSD controller can do.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss