Ragnar Sundblad wrote:
On 2 jan 2010, at 22.49, Erik Trimble wrote:

Ragnar Sundblad wrote:
On 2 jan 2010, at 13.10, Erik Trimble wrote
Joerg Schilling wrote:
   the TRIM command is what is intended for an OS to notify the SSD as to which 
blocks are deleted/erased, so the SSD's internal free list can be updated (that 
is, it allows formerly-in-use blocks to be moved to the free list).  This is 
necessary since only the OS has the information to determine which 
previous-written-to blocks are actually no longer in-use.

See the parallel discussion here titled "preview of new SSD based on SandForce controller" for more 
about "smart" vs "dumb" SSD controllers.

From ZFS's standpoint, the optimal configuration would be for the SSD to inform 
ZFS as to it's PAGE size, and ZFS would use this as the fundamental BLOCK size 
for that device (i.e. all writes are in integer multiples of the SSD page 
size).  Reads could be in smaller sections, though.  Which would be 
interesting:  ZFS would write in Page Size increments, and read in Block Size 
amounts.
Well, this could be useful if updates are larger than the block size, for 
example 512 K, as it is then possible to erase and rewrite without having to 
copy around other data from the page. If updates are smaller, zfs will have to 
reclaim erased space by itself, which if I am not mistaken it can not do today 
(but probably will in some future, I guess the BP Rewrite is what is needed).
Sure, it does that today. What do you think happens on a standard COW action?   Let's be 
clear here:  I'm talking about exactly the same thing that currently happens when you 
modify a ZFS "block" that spans multiple vdevs (say, in a RAIDZ).   The entire 
ZFS block is read from disk/L2ARC, the modifications made, then it is written back to 
storage, likely in another LBA. The original ZFS block location ON THE VDEV is now 
available for re-use (i.e. the vdev adds it to it's Free Block List).   This is one of 
the things that leads to ZFS's fragmentation issues (note, we're talking about block 
fragmentation on the vdev, not ZFS block fragmentation), and something that we're looking 
to BP rewrite to enable defragging to be implemented.

What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn't know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)
You certainly DO get to reuse the free space again. Here's what happens nowdays in an SSD:

Let's say I have 4k blocks, grouped into a 128k page. That is, the SSD's fundamental minimum unit size is 4k, but the minimum WRITE size is 128k. Thus, 32 blocks in a page.

So, I write a bit of data 100k in size. This occupies the first 25 blocks in the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. list of free space).

Now, I want to change the last byte of the file, and add 10k more to the file. Currently, a non-COW filesystem will simply send the 1 byte modification request and the 10k addition to the SSD (all as one unit, if you are lucky - if not, it comes as two ops: 1 byte modification followed by a 10k append). The SSD now has to read all 25 blocks of the page back into it's local cache on the controller, do the modification and append computing, then writes out 28 blocks to NAND. In all likelihood, if there is any extra pre-erased (or never written to) space on the drive, this 28 block write will go to a whole new page. The blocks in the original page will be moved over to the SSD Free List (and may or may not be actually erased, depending on the controller).

For filesystems like ZFS, this is a whole lot of extra work being done that doesn't need to happen (and, chews up valuable IOPS and time). For, when ZFS does a write, it doesn't merely just twiddle the modified/appended bits - instead, it creates a whole new ZFS block to write. In essence, ZFS has already done all the work that the SSD controller is planning on doing. So why duplicate the effort? SSDs should simply notify ZFS about their block & page sizes, which would then allow ZFS to better align it's own variable block size to optimally coincide with the SSD's implementation.


And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.
ZFS's propensity to fragmentation doesn't mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device's Free List. Now, in SSD's case, this isn't a worry. Due to the completely even performance characteristics of NAND, it doesn't make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD. Access time is identical, and so is read time. SSD's don't care about this kind of fragmentation.

What SSD's have to worry about is sub-page fragmentation. Which brings us back to the whole R-M-W mess.


Still, it could certainly be useful if zfs could try to use a
blocksize that matches the SSD erase page size - this could
avoid having to copy and compact data before erasing, which
could speed up writes in a typical flash SSD disk.

In fact, I would argue that the biggest advantage of removing any advanced intelligence from the SSD controller is with small modifications to existing files. By using the L2ARC (and other features, like compression, encryption, and dedup), ZFS can composite the needed changes with an existing cached copy of the ZFS block(s) to be changed, then issue a full new block write to the SSD. This eliminates the need for the SSD to do the dreaded Read-Modify-Write cycle, and instead do just a Write. In this scenario, the ZFS block is likely larger than the SSD Page size, so more data will need to be written; however, given the highly parallel nature of SSDs, writing several SSD pages simultaneously is easy (and fast); let's remember that a ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages is only slightly more work than writing 1 page. This larger write is all a single IOP, where a R-M-W essentially requires 3 IOPS. If you want the SSD controller
to do the work, then it ALWAYS has to read the to-be-modified page from NAND, do the mod 
itself, then issue the write - and, remember, ZFS likely has already issued a full 
ZFS-block write (due to the COW nature of ZFS, there is no concept of "just change 
this 1 bit and leave everything else on disk where it is"), so you likely don't save 
on the number of pages that need to be written in any case.

I don't think many SSDs do R-M-W, but rather just append blocks
to free pages (pretty much as zfs works, if you will). They also
have to do some space reclamation (copying/compacting blocks and
erasing pages) in the background, of course.

MLC-based SSDs all do R-M-W. Now, they might not do Read-Modify-Erase-Write right away: But they'll do R-M-W on ANY write which modifies existing data (unless you are extremely lucky and your data fully fills an existing page): the difference is that the final W is to previous-unused NAND page(s). However, when the SSD runs out of never-used space, it starts to have to add the E step on future writes.

So far as I know, no SSD does space reclamation in the manner you refer to. That is, the SSD controller isn't going to be moving data around on its own, with the exception of wear-leveling. TRIM is there so that the SSD can add stuff to it's internal Free List more efficiently, but an SSD isn't going (on its own) say: "Ooh, page 1004 has only 5 of 10 blocks used, so why don't we merge it with page 20054, which has only 3 of 10 blocks used."
I am still not entirely convinced that it would be better to let the file 
system take care of that instead of a flash controller, there could be quite a 
lot of reading and writing going on for space reclamation (depending on the 
work load, of course).
/ragge
The point here is that regardless of the workload, there's a R-M-W cycle that has to happen, whether that occurs at the ZFS level or at the SSD level. My argument is that the OS has a far better view of the whole data picture, and access to much higher performing caches (i.e. RAM/registers) than the SSD, so not only can the OS make far better decisions about the data and how (and how much of) it should be stored, but it's almost certainly to be able to do so far faster than any little SSD controller can do.

Well, inside the flash system you could possibly have a much
better situation to shuffle data around for space reclamation -
that is copying and compacting data and erasing flash pages.
If the device has a good design, that is! If the SSD controller
is some small slow sad thing it might be better to shuffle it up
and down to the host and do it in the CPU, but I am not sure
about that either since it typically is the very same slow
controller that does the host communication.
It's actually far more likely that a dumb SSD controller can handle high levels of pure data transfer faster than a smart SSD controller can actually manipulate that same data quickly. SSD controllers, by their very nature, need to be as small and cheap as possible, which means they have extremely limited computation ability. For a given compute level controller, one which is only "dumb" has to worry about 4 things: wear leveling, bad block remapping, and LBA->physical block mapping, and actual I/O transfer (i.e. managing data flow from the host to the NAND chips). A smart controller also has to worry about page alignment, page modification and rewriting, potentially RAID-like checksumming/parity, page/block fragmentation, and other things. So, if the compute amount is fixed, a dumb controller is going to be able to handle a /whole/ lot more I/O transfer than a smart controller. Which means, for the same level of I/O transfer, a dumb controller costs less than a smart controller.


I certainly agree that there seems to be some redundancy when
the flash SSD controller does a logging-file-system kind of work
under zfs that does pretty much that by itself, and it could
possibly be better to cut one of them (and not zfs).
I am still not convinced that it won't be better to do this
in a good controller instead just for speed and to take advantage
of new hardware that does this smarter than the devices of today.

Do you know how the F5100 works for example?

/ragge
The point I'm making here is that the filesystem/OS can make all the same decisions that a good SSD controller can make, faster (as it has most of the data in local RAM or register already), and with a global system viewpoint that the SSD simply can't have. Most importantly, it's essentially free for the OS to do so - it has the spare cycles and bandwidth to do so. Putting this intelligence on the SSD costs money that is essentially wasted, not to mention being less efficient overall.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to