Re: [zfs-discuss] Thin device support in ZFS?

Ragnar Sundblad Sat, 02 Jan 2010 21:09:21 -0800

On 3 jan 2010, at 04.19, Erik Trimble wrote:

> Ragnar Sundblad wrote:
>> On 2 jan 2010, at 22.49, Erik Trimble wrote:
>> 
>>  
>>> Ragnar Sundblad wrote:
>>>    
>>>> On 2 jan 2010, at 13.10, Erik Trimble wrote
>>>>      
>>>>> Joerg Schilling wrote:
>>>>>   the TRIM command is what is intended for an OS to notify the SSD as to 
>>>>> which blocks are deleted/erased, so the SSD's internal free list can be 
>>>>> updated (that is, it allows formerly-in-use blocks to be moved to the 
>>>>> free list).  This is necessary since only the OS has the information to 
>>>>> determine which previous-written-to blocks are actually no longer in-use.
>>>>> 
>>>>> See the parallel discussion here titled "preview of new SSD based on 
>>>>> SandForce controller" for more about "smart" vs "dumb" SSD controllers.
>>>>> 
>>>>> From ZFS's standpoint, the optimal configuration would be for the SSD to 
>>>>> inform ZFS as to it's PAGE size, and ZFS would use this as the 
>>>>> fundamental BLOCK size for that device (i.e. all writes are in integer 
>>>>> multiples of the SSD page size).  Reads could be in smaller sections, 
>>>>> though.  Which would be interesting:  ZFS would write in Page Size 
>>>>> increments, and read in Block Size amounts.
>>>>>           
>>>> Well, this could be useful if updates are larger than the block size, for 
>>>> example 512 K, as it is then possible to erase and rewrite without having 
>>>> to copy around other data from the page. If updates are smaller, zfs will 
>>>> have to reclaim erased space by itself, which if I am not mistaken it can 
>>>> not do today (but probably will in some future, I guess the BP Rewrite is 
>>>> what is needed).
>>>>       
>>> Sure, it does that today. What do you think happens on a standard COW 
>>> action?   Let's be clear here:  I'm talking about exactly the same thing 
>>> that currently happens when you modify a ZFS "block" that spans multiple 
>>> vdevs (say, in a RAIDZ).   The entire ZFS block is read from disk/L2ARC, 
>>> the modifications made, then it is written back to storage, likely in 
>>> another LBA. The original ZFS block location ON THE VDEV is now available 
>>> for re-use (i.e. the vdev adds it to it's Free Block List).   This is one 
>>> of the things that leads to ZFS's fragmentation issues (note, we're talking 
>>> about block fragmentation on the vdev, not ZFS block fragmentation), and 
>>> something that we're looking to BP rewrite to enable defragging to be 
>>> implemented.
>>>    
>> 
>> What I am talking about is to be able to reuse the free space
>> you get in the previously written data when you write modified
>> data to new places on the disk, or just remove a file for that
>> matter. To be able to reclaim that space with flash, you have
>> to erase large pages (for example 512 KB), but before you erase,
>> you will also have to save away all still valid data in that
>> page and rewrite that to a free page. What I am saying is that
>> I am not sure that this would be best done in the file system,
>> since it could be quite a bit of data to shuffle around, and
>> there could possibly be hardware specific optimizations that
>> could be done here that zfs wouldn't know about. A good flash
>> controller could probably do it much better. (And a bad one
>> worse, of course.)
>>  
> You certainly DO get to reuse the free space again.   Here's what happens 
> nowdays in an SSD:
> 
> Let's say I have 4k blocks, grouped into a 128k page.  That is, the SSD's 
> fundamental minimum unit size is 4k, but the minimum WRITE size is 128k.  
> Thus, 32 blocks in a page.


Do you know of SSD disks that have a minimum write size of
128 KB? I don't understand why it would be designed that way.

A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!

> So, I write a bit of data 100k in size. This occupies the first 25 blocks in 
> the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. 
> list of free space).
> 
> Now, I want to change the last byte of the file, and add 10k more to the 
> file.  Currently, a non-COW filesystem will simply send the 1 byte 
> modification request and the 10k addition to the SSD (all as one unit, if you 
> are lucky - if not, it comes as two ops: 1 byte modification followed by a 
> 10k append).   The SSD now has to read all 25 blocks of the page back into 
> it's local cache on the controller, do the modification and append computing, 
> then writes out 28 blocks to NAND.  In all likelihood, if there is any extra 
> pre-erased (or never written to) space on the drive, this 28 block write will 
> go to a whole new page.  The blocks in the original page will be moved over 
> to the SSD Free List (and may or may not be actually erased, depending on the 
> controller).

Do you know for sure that you have SSD flash disks that
work this way? It seems incredibly stupid. It would also
use up the available erase cycles much faster than necessary.
What write speed do you get?

> For filesystems like ZFS, this is a whole lot of extra work being done that 
> doesn't need to happen (and, chews up valuable IOPS and time).  For, when ZFS 
> does a write, it doesn't merely just twiddle the modified/appended bits - 
> instead, it creates a whole new ZFS block to write.   In essence, ZFS has 
> already done all the work that the SSD controller is planning on doing.  So 
> why duplicate the effort?   SSDs should simply notify ZFS about their block & 
> page sizes, which would then allow ZFS to better align it's own variable 
> block size to optimally coincide with the SSD's implementation.
> 
> 
>> And as far as I know, zfs can not do that today - it can not
>> move around already written data, not for defragmentation, not
>> for adding or removing disks to stripes/raidz:s, not for
>> deduping/duping and so on, and I have understood it as
>> BP Rewrite could solve a lot of this.
>>  
> ZFS's propensity to fragmentation doesn't mean you lose space.  Rather, it 
> means that COW often results in frequently-modified files being distributed 
> over the entire media, rather than being contiguous. So, over time, the 
> actual media has very little (if any) contiguous free space, which is what 
> the fragmentation problem is.  BP rewrite will indeed allow us to create a 
> de-fragger.  Areas which used to hold a ZFS block (now vacated by a COW to 
> somewhere else) are simply added back to the device's Free List. 
> Now, in SSD's case, this isn't a worry.  Due to the completely even 
> performance characteristics of NAND, it doesn't make any difference if the 
> physical layout of a file happens to be sections (e.g. ZFS blocks) scattered 
> all over the SSD.

Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.

>  Access time is identical, and so is read time.  SSD's don't care about this 
> kind of fragmentation.
> What SSD's have to worry about is sub-page fragmentation.  Which brings us 
> back to the whole R-M-W mess.

Yes, why R-M-W of entire pages for every change is a really
bad implementation of a flash SSD.

> Still, it could certainly be useful if zfs could try to use a
>> blocksize that matches the SSD erase page size - this could
>> avoid having to copy and compact data before erasing, which
>> could speed up writes in a typical flash SSD disk.
>> 
>>  
>>> In fact, I would argue that the biggest advantage of removing any advanced 
>>> intelligence from the SSD controller is with small modifications to 
>>> existing files.  By using the L2ARC (and other features, like compression, 
>>> encryption, and dedup), ZFS can composite the needed changes with an 
>>> existing cached copy of the ZFS block(s) to be changed, then issue a full 
>>> new block write to the SSD.  This eliminates the need for the SSD to do the 
>>> dreaded Read-Modify-Write cycle, and instead do just a Write.  In this 
>>> scenario, the ZFS block is likely larger than the SSD Page size, so more 
>>> data will need to be written; however, given the highly parallel nature of 
>>> SSDs, writing several SSD pages simultaneously is easy (and fast);  let's 
>>> remember that a ZFS block is a maximum of only 8x the size of a SSD page, 
>>> and writing 8 pages is only slightly more work than writing 1 page.  This 
>>> larger write is all a single IOP, where a R-M-W essentially requires 3 
>>> IOPS.  If you want the SSD controller
  to do the work, then it ALWAYS has to read the to-be-modified page from NAND, 
do the mod itself, then issue the write - and, remember, ZFS likely has already 
issued a full ZFS-block write (due to the COW nature of ZFS, there is no 
concept of "just change this 1 bit and leave everything else on disk where it 
is"), so you likely don't save on the number of pages that need to be written 
in any case.
>>>    
>> 
>> I don't think many SSDs do R-M-W, but rather just append blocks
>> to free pages (pretty much as zfs works, if you will). They also
>> have to do some space reclamation (copying/compacting blocks and
>> erasing pages) in the background, of course.
> 
>>  
> MLC-based SSDs  all do R-M-W.  Now, they might not do Read-Modify-Erase-Write 
> right away:   But they'll do R-M-W on ANY write which modifies existing data 
> (unless you are extremely lucky and your data fully fills an existing page):  
> the difference is that the final W is to previous-unused NAND page(s).  
> However, when the SSD runs out of never-used space, it starts to have to add 
> the E step on future writes.
> 
> So far as I know, no SSD does space reclamation in the manner you refer to. 
> That is, the SSD controller isn't going to be moving data around on its own, 
> with the exception of wear-leveling.  TRIM is there so that the SSD can add 
> stuff to it's internal Free List more efficiently, but an SSD isn't going (on 
> its own) say:  "Ooh, page 1004 has only 5 of 10 blocks used, so why don't we 
> merge it with page 20054, which has only 3 of 10 blocks used." 

(I don't think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the "end" of the disk, and erases the page.)

Well, the algorithms are often trade secrets, and if what you say
is correct, and it was my product, then I wouldn't even want to
tell anyone about it, since it would be a horrible waste of both
bandwidth and erase cycles. Using up the 10000 erase cycles of
a MLC device 64 times faster than necessary seems like an
extremely bad idea. But there sure is a lot of crap out there,
I can't say you are wrong (only hope :-).

I doubt for example the F5100 works that way, it would be hard to
get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that
(you typically can erase only 500-1000 pages a second, for
example).
I doubt the Intel X25 works that way, as their read performance
suffers if they are written with smaller blocks and get internally
fragmented - that problem could not exist if they always filled
complete new pages in a R-M-W manner.

>>> I am still not entirely convinced that it would be better to let the file 
>>> system take care of that instead of a flash controller, there could be 
>>> quite a lot of reading and writing going on for space reclamation 
>>> (depending on the work load, of course).
>>>    
>>>> /ragge
>>>>      
>>> The point here is that regardless of the workload, there's a R-M-W cycle 
>>> that has to happen, whether that occurs at the ZFS level or at the SSD 
>>> level.  My argument is that the OS has a far better view of the whole data 
>>> picture, and access to much higher performing caches (i.e. RAM/registers) 
>>> than the SSD, so not only can the OS make far better decisions about the 
>>> data and how (and how much of) it should be stored, but it's almost 
>>> certainly to be able to do so far faster than any little SSD controller can 
>>> do.     
>> 
>> Well, inside the flash system you could possibly have a much
>> better situation to shuffle data around for space reclamation -
>> that is copying and compacting data and erasing flash pages.
>> If the device has a good design, that is! If the SSD controller
>> is some small slow sad thing it might be better to shuffle it up
>> and down to the host and do it in the CPU, but I am not sure
>> about that either since it typically is the very same slow
>> controller that does the host communication.
>>  
> It's actually far more likely that a dumb SSD controller can handle high 
> levels of pure data transfer faster than a smart SSD controller can actually 
> manipulate that same data quickly.   SSD controllers, by their very nature, 
> need to be as small and cheap as possible, which means they have extremely 
> limited computation ability.  For a given compute level controller, one which 
> is only "dumb" has to worry about 4 things: wear leveling, bad block 
> remapping, and LBA->physical block mapping, and actual I/O transfer (i.e. 
> managing data flow from the host to the NAND chips).   A smart controller 
> also has to worry about page alignment, page modification and rewriting, 
> potentially RAID-like checksumming/parity,  page/block fragmentation, and 
> other things.  So, if the compute amount is fixed, a dumb controller is going 
> to be able to handle a /whole/ lot more I/O transfer than a smart controller. 
>    Which means, for the same level of I/O transfer, a dumb controller costs 
> less than a s
 mart controller.

I am not convinced the compute amount needs to be fixed, or
even that they by their nature need to be as cheap as possible -
if that hurts performance. People are obviously willing to pay
quite a lot to get high perf disk systems. The best flash SSDs
out there are quite expensive. In addition the number of
transistors per area (and monetary unit) tend to increase
with time (that intel guy had some saying about that... :-).

> I certainly agree that there seems to be some redundancy when
>> the flash SSD controller does a logging-file-system kind of work
>> under zfs that does pretty much that by itself, and it could
>> possibly be better to cut one of them (and not zfs).
>> I am still not convinced that it won't be better to do this
>> in a good controller instead just for speed and to take advantage
>> of new hardware that does this smarter than the devices of today.
>> 
>> Do you know how the F5100 works for example?
>> 
>> /ragge
>>  
> The point I'm making here is that the filesystem/OS can make all the same 
> decisions that a good SSD controller can make, faster (as it has most of the 
> data in local RAM or register already), and with a global system viewpoint 
> that the SSD simply can't have.  Most importantly, it's essentially free for 
> the OS to do so - it has the spare cycles and bandwidth to do so.  Putting 
> this intelligence on the SSD costs money that is essentially wasted, not to 
> mention being less efficient overall.

I have not done the math here, but to me it isn't obvious that
the OS has spare cycles and bandwidth to do it, since space
reclaiming (compacting and erasing) could potentially draw much
more bandwidth than the actual workload, and since people have
had problem already with to few spare cycles on the X4500
if they want it to do something more than only being a
filer (and I guess is where there now is a X4550).
The filesystem/OS will most probably *not* have most of the
data in local ram when reclaiming space/compacting memory,
it will most likely have to read it in to write it out again.

/ragge

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

Reply via email to