Ragnar Sundblad wrote:
On 2 jan 2010, at 13.10, Erik Trimble wrote
Joerg Schilling wrote:
the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use.

See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more 
about "smart" vs "dumb" SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.

Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).
Sure, it does that today. What do you think happens on a standard COW action? Let's be clear here: I'm talking about exactly the same thing that currently happens when you modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC, the modifications made, then it is written back to storage, likely in another LBA. The original ZFS block location ON THE VDEV is now available for re-use (i.e. the vdev adds it to it's Free Block List). This is one of the things that leads to ZFS's fragmentation issues (note, we're talking about block fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking to BP rewrite to enable defragging to be implemented.

In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let's remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod itself, then issue the write - and, remember, ZFS likely has already issued a full ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change this 1 bit and leave everything else on disk where it is"), so you likely don't save on the number of pages that need to be written in any case.


I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).

/ragge
The point here is that regardless of the workload, there's a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do.
--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to