On 2 jan 2010, at 22.49, Erik Trimble wrote:

> Ragnar Sundblad wrote:
>> On 2 jan 2010, at 13.10, Erik Trimble wrote
>>> Joerg Schilling wrote:
>>>    the TRIM command is what is intended for an OS to notify the SSD as to 
>>> which blocks are deleted/erased, so the SSD's internal free list can be 
>>> updated (that is, it allows formerly-in-use blocks to be moved to the free 
>>> list).  This is necessary since only the OS has the information to 
>>> determine which previous-written-to blocks are actually no longer in-use.
>>> 
>>> See the parallel discussion here titled "preview of new SSD based on 
>>> SandForce controller" for more about "smart" vs "dumb" SSD controllers.
>>> 
>>> From ZFS's standpoint, the optimal configuration would be for the SSD to 
>>> inform ZFS as to it's PAGE size, and ZFS would use this as the fundamental 
>>> BLOCK size for that device (i.e. all writes are in integer multiples of the 
>>> SSD page size).  Reads could be in smaller sections, though.  Which would 
>>> be interesting:  ZFS would write in Page Size increments, and read in Block 
>>> Size amounts.
>>>    
>> 
>> Well, this could be useful if updates are larger than the block size, for 
>> example 512 K, as it is then possible to erase and rewrite without having to 
>> copy around other data from the page. If updates are smaller, zfs will have 
>> to reclaim erased space by itself, which if I am not mistaken it can not do 
>> today (but probably will in some future, I guess the BP Rewrite is what is 
>> needed).
>>  
> Sure, it does that today. What do you think happens on a standard COW action? 
>   Let's be clear here:  I'm talking about exactly the same thing that 
> currently happens when you modify a ZFS "block" that spans multiple vdevs 
> (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, the 
> modifications made, then it is written back to storage, likely in another 
> LBA. The original ZFS block location ON THE VDEV is now available for re-use 
> (i.e. the vdev adds it to it's Free Block List).   This is one of the things 
> that leads to ZFS's fragmentation issues (note, we're talking about block 
> fragmentation on the vdev, not ZFS block fragmentation), and something that 
> we're looking to BP rewrite to enable defragging to be implemented.

What I am talking about is to be able to reuse the free space
you get in the previously written data when you write modified
data to new places on the disk, or just remove a file for that
matter. To be able to reclaim that space with flash, you have
to erase large pages (for example 512 KB), but before you erase,
you will also have to save away all still valid data in that
page and rewrite that to a free page. What I am saying is that
I am not sure that this would be best done in the file system,
since it could be quite a bit of data to shuffle around, and
there could possibly be hardware specific optimizations that
could be done here that zfs wouldn't know about. A good flash
controller could probably do it much better. (And a bad one
worse, of course.)

And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.

Still, it could certainly be useful if zfs could try to use a
blocksize that matches the SSD erase page size - this could
avoid having to copy and compact data before erasing, which
could speed up writes in a typical flash SSD disk.

> In fact, I would argue that the biggest advantage of removing any advanced 
> intelligence from the SSD controller is with small modifications to existing 
> files.  By using the L2ARC (and other features, like compression, encryption, 
> and dedup), ZFS can composite the needed changes with an existing cached copy 
> of the ZFS block(s) to be changed, then issue a full new block write to the 
> SSD.  This eliminates the need for the SSD to do the dreaded 
> Read-Modify-Write cycle, and instead do just a Write.  In this scenario, the 
> ZFS block is likely larger than the SSD Page size, so more data will need to 
> be written; however, given the highly parallel nature of SSDs, writing 
> several SSD pages simultaneously is easy (and fast);  let's remember that a 
> ZFS block is a maximum of only 8x the size of a SSD page, and writing 8 pages 
> is only slightly more work than writing 1 page.  This larger write is all a 
> single IOP, where a R-M-W essentially requires 3 IOPS.  If you want the SSD 
> controller t
 o do the work, then it ALWAYS has to read the to-be-modified page from NAND, 
do the mod itself, then issue the write - and, remember, ZFS likely has already 
issued a full ZFS-block write (due to the COW nature of ZFS, there is no 
concept of "just change this 1 bit and leave everything else on disk where it 
is"), so you likely don't save on the number of pages that need to be written 
in any case.

I don't think many SSDs do R-M-W, but rather just append blocks
to free pages (pretty much as zfs works, if you will). They also
have to do some space reclamation (copying/compacting blocks and
erasing pages) in the background, of course.

> I am still not entirely convinced that it would be better to let the file 
> system take care of that instead of a flash controller, there could be quite 
> a lot of reading and writing going on for space reclamation (depending on the 
> work load, of course).
>> 
>> /ragge
> The point here is that regardless of the workload, there's a R-M-W cycle that 
> has to happen, whether that occurs at the ZFS level or at the SSD level.  My 
> argument is that the OS has a far better view of the whole data picture, and 
> access to much higher performing caches (i.e. RAM/registers) than the SSD, so 
> not only can the OS make far better decisions about the data and how (and how 
> much of) it should be stored, but it's almost certainly to be able to do so 
> far faster than any little SSD controller can do. 

Well, inside the flash system you could possibly have a much
better situation to shuffle data around for space reclamation -
that is copying and compacting data and erasing flash pages.
If the device has a good design, that is! If the SSD controller
is some small slow sad thing it might be better to shuffle it up
and down to the host and do it in the CPU, but I am not sure
about that either since it typically is the very same slow
controller that does the host communication.

I certainly agree that there seems to be some redundancy when
the flash SSD controller does a logging-file-system kind of work
under zfs that does pretty much that by itself, and it could
possibly be better to cut one of them (and not zfs).
I am still not convinced that it won't be better to do this
in a good controller instead just for speed and to take advantage
of new hardware that does this smarter than the devices of today.

Do you know how the F5100 works for example?

/ragge

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to