Re: [zfs-discuss] Thin device support in ZFS?

Erik Trimble Sat, 02 Jan 2010 19:20:51 -0800

Ragnar Sundblad wrote:

On 2 jan 2010, at 22.49, Erik Trimble wrote:

Ragnar Sundblad wrote:

On 2 jan 2010, at 13.10, Erik Trimble wrote

Joerg Schilling wrote:
   the TRIM command is what is intended for an OS to notify the SSD as to which 
blocks are deleted/erased, so the SSD's internal free list can be updated (that 
is, it allows formerly-in-use blocks to be moved to the free list).  This is 
necessary since only the OS has the information to determine which 
previous-written-to blocks are actually no longer in-use.

See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more 
about "smart" vs "dumb" SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.

Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).

Sure, it does that today. What do you think happens on a standard COW action?   Let's be 
clear here:  I'm talking about exactly the same thing that currently happens when you 
modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ).   The entire 
ZFS block is read from disk/L2ARC, the modifications made, then it is written back to 
storage, likely in another LBA. The original ZFS block location ON THE VDEV is now 
available for re-use (i.e. the vdev adds it to it's Free Block List).   This is one of 
the things that leads to ZFS's fragmentation issues (note, we're talking about block 
fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking 
to BP rewrite to enable defragging to be implemented.


What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn't know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)

You certainly DO get to reuse the free space again. Here's whathappens nowdays in an SSD:

Let's say I have 4k blocks, grouped into a 128k page. That is, theSSD's fundamental minimum unit size is 4k, but the minimum WRITE size is128k. Thus, 32 blocks in a page.

So, I write a bit of data 100k in size. This occupies the first 25blocks in the one page. The remaining 9 blocks are still one the SSD'sFree List (i.e. list of free space).

Now, I want to change the last byte of the file, and add 10k more to thefile. Currently, a non-COW filesystem will simply send the 1 bytemodification request and the 10k addition to the SSD (all as one unit,if you are lucky - if not, it comes as two ops: 1 byte modificationfollowed by a 10k append). The SSD now has to read all 25 blocks ofthe page back into it's local cache on the controller, do themodification and append computing, then writes out 28 blocks to NAND.In all likelihood, if there is any extra pre-erased (or never writtento) space on the drive, this 28 block write will go to a whole newpage. The blocks in the original page will be moved over to the SSDFree List (and may or may not be actually erased, depending on thecontroller).

For filesystems like ZFS, this is a whole lot of extra work being donethat doesn't need to happen (and, chews up valuable IOPS and time).For, when ZFS does a write, it doesn't merely just twiddle themodified/appended bits - instead, it creates a whole new ZFS block towrite. In essence, ZFS has already done all the work that the SSDcontroller is planning on doing. So why duplicate the effort? SSDsshould simply notify ZFS about their block & page sizes, which wouldthen allow ZFS to better align it's own variable block size to optimallycoincide with the SSD's implementation.

And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.

ZFS's propensity to fragmentation doesn't mean you lose space. Rather,it means that COW often results in frequently-modified files beingdistributed over the entire media, rather than being contiguous. So,over time, the actual media has very little (if any) contiguous freespace, which is what the fragmentation problem is. BP rewrite willindeed allow us to create a de-fragger. Areas which used to hold a ZFSblock (now vacated by a COW to somewhere else) are simply added back tothe device's Free List.Now, in SSD's case, this isn't a worry. Due to the completely evenperformance characteristics of NAND, it doesn't make any difference ifthe physical layout of a file happens to be sections (e.g. ZFS blocks)scattered all over the SSD. Access time is identical, and so is readtime. SSD's don't care about this kind of fragmentation.

What SSD's have to worry about is sub-page fragmentation. Which bringsus back to the whole R-M-W mess.

Still, it could certainly be useful if zfs could try to use a
blocksize that matches the SSD erase page size - this could
avoid having to copy and compact data before erasing, which
could speed up writes in a typical flash SSD disk.
In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let's remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller

to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod 
itself, then issue the write - and, remember, ZFS likely has already issued a full 
ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change 
this 1 bit and leave everything else on disk where it is"), so you likely don't save 
on the number of pages that need to be written in any case.


I don't think many SSDs do R-M-W, but rather just append blocks
to free pages (pretty much as zfs works, if you will). They also
have to do some space reclamation (copying/compacting blocks and
erasing pages) in the background, of course.

MLC-based SSDs all do R-M-W. Now, they might not doRead-Modify-Erase-Write right away: But they'll do R-M-W on ANY writewhich modifies existing data (unless you are extremely lucky and yourdata fully fills an existing page): the difference is that the final Wis to previous-unused NAND page(s). However, when the SSD runs out ofnever-used space, it starts to have to add the E step on future writes.

So far as I know, no SSD does space reclamation in the manner you referto. That is, the SSD controller isn't going to be moving data around onits own, with the exception of wear-leveling. TRIM is there so that theSSD can add stuff to it's internal Free List more efficiently, but anSSD isn't going (on its own) say: "Ooh, page 1004 has only 5 of 10blocks used, so why don't we merge it with page 20054, which has only 3of 10 blocks used."

I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).
/ragge
The point here is that regardless of the workload, there's a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do.
Well, inside the flash system you could possibly have a much
better situation to shuffle data around for space reclamation -
that is copying and compacting data and erasing flash pages.
If the device has a good design, that is! If the SSD controller
is some small slow sad thing it might be better to shuffle it up
and down to the host and do it in the CPU, but I am not sure
about that either since it typically is the very same slow
controller that does the host communication.

It's actually far more likely that a dumb SSD controller can handle highlevels of pure data transfer faster than a smart SSD controller canactually manipulate that same data quickly. SSD controllers, by theirvery nature, need to be as small and cheap as possible, which means theyhave extremely limited computation ability. For a given compute levelcontroller, one which is only "dumb" has to worry about 4 things: wearleveling, bad block remapping, and LBA->physical block mapping, andactual I/O transfer (i.e. managing data flow from the host to the NANDchips). A smart controller also has to worry about page alignment,page modification and rewriting, potentially RAID-likechecksumming/parity, page/block fragmentation, and other things. So,if the compute amount is fixed, a dumb controller is going to be able tohandle a /whole/ lot more I/O transfer than a smart controller. Whichmeans, for the same level of I/O transfer, a dumb controller costs lessthan a smart controller.

I certainly agree that there seems to be some redundancy when
the flash SSD controller does a logging-file-system kind of work
under zfs that does pretty much that by itself, and it could
possibly be better to cut one of them (and not zfs).
I am still not convinced that it won't be better to do this
in a good controller instead just for speed and to take advantage
of new hardware that does this smarter than the devices of today.

Do you know how the F5100 works for example?

/ragge

The point I'm making here is that the filesystem/OS can make all thesame decisions that a good SSD controller can make, faster (as it hasmost of the data in local RAM or register already), and with a globalsystem viewpoint that the SSD simply can't have. Most importantly, it'sessentially free for the OS to do so - it has the spare cycles andbandwidth to do so. Putting this intelligence on the SSD costs moneythat is essentially wasted, not to mention being less efficient overall.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

Reply via email to