Re: [zfs-discuss] Thin device support in ZFS?

Erik Trimble Sat, 02 Jan 2010 13:50:23 -0800

Ragnar Sundblad wrote:

On 2 jan 2010, at 13.10, Erik Trimble wrote

Joerg Schilling wrote:
the TRIM command is what is intended for an OS to notify the SSD as to which blocks are deleted/erased, so the SSD's internal free list can be updated (that is, it allows formerly-in-use blocks to be moved to the free list). This is necessary since only the OS has the information to determine which previous-written-to blocks are actually no longer in-use.
See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more 
about "smart" vs "dumb" SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.


Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).

Sure, it does that today. What do you think happens on a standard COWaction? Let's be clear here: I'm talking about exactly the same thingthat currently happens when you modify a ZFS "block" that spans multiplevdevs (say, in a RAIDZ). The entire ZFS block is read from disk/L2ARC,the modifications made, then it is written back to storage, likely inanother LBA. The original ZFS block location ON THE VDEV is nowavailable for re-use (i.e. the vdev adds it to it's Free Block List).This is one of the things that leads to ZFS's fragmentation issues(note, we're talking about block fragmentation on the vdev, not ZFSblock fragmentation), and something that we're looking to BP rewrite toenable defragging to be implemented.

In fact, I would argue that the biggest advantage of removing anyadvanced intelligence from the SSD controller is with smallmodifications to existing files. By using the L2ARC (and otherfeatures, like compression, encryption, and dedup), ZFS can compositethe needed changes with an existing cached copy of the ZFS block(s) tobe changed, then issue a full new block write to the SSD. Thiseliminates the need for the SSD to do the dreaded Read-Modify-Writecycle, and instead do just a Write. In this scenario, the ZFS block islikely larger than the SSD Page size, so more data will need to bewritten; however, given the highly parallel nature of SSDs, writingseveral SSD pages simultaneously is easy (and fast); let's rememberthat a ZFS block is a maximum of only 8x the size of a SSD page, andwriting 8 pages is only slightly more work than writing 1 page. Thislarger write is all a single IOP, where a R-M-W essentially requires 3IOPS. If you want the SSD controller to do the work, then it ALWAYS hasto read the to-be-modified page from NAND, do the mod itself, then issuethe write - and, remember, ZFS likely has already issued a fullZFS-block write (due to the COW nature of ZFS, there is no concept of"just change this 1 bit and leave everything else on disk where it is"),so you likely don't save on the number of pages that need to be writtenin any case.

I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).

/ragge

The point here is that regardless of the workload, there's a R-M-W cyclethat has to happen, whether that occurs at the ZFS level or at the SSDlevel. My argument is that the OS has a far better view of the wholedata picture, and access to much higher performing caches (i.e.RAM/registers) than the SSD, so not only can the OS make far betterdecisions about the data and how (and how much of) it should be stored,but it's almost certainly to be able to do so far faster than any littleSSD controller can do.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

Reply via email to