Re: [zfs-discuss] Thin device support in ZFS?

Erik Trimble Sat, 02 Jan 2010 22:26:25 -0800

Ragnar Sundblad wrote:

On 3 jan 2010, at 04.19, Erik Trimble wrote:

Let's say I have 4k blocks, grouped into a 128k page.  That is, the SSD's 
fundamental minimum unit size is 4k, but the minimum WRITE size is 128k.  Thus, 
32 blocks in a page.

Do you know of SSD disks that have a minimum write size of
128 KB? I don't understand why it would be designed that way.


A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!

In MLC-style SSDs, you typically have a block size of 2k or 4k. However,you have a Page size of several multiples of that, 128k being common,but by no means ubiquitous.


I think you're confusing erasing with writing.

When I say "minimum write size", I mean that for an MLC, no matter howsmall you make a change, the minimum amount of data actually beingwritten to the SSD is a full page (128k in my example). There is no"append" down at this level. If I have a page of 128k, with data in 5 ofthe 4k blocks, and I then want to add another 2k of data to this, I haveto READ all 5 4k blocks into the controller's DRAM, add the 2k of datato that, then write out the full amount to a new page (if available), orwait for a older page to be erased before writing to it. Thus, in thiscase, in order to do an actual 2k write, the SSD must first read 10k ofdata, do some compositing, then write 12k to a fresh page.Thus, to change any data inside a single page, then entire contents ofthat page have to be read, the page modified, then the entire pagewritten back out.

So, I write a bit of data 100k in size. This occupies the first 25 blocks in 
the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. 
list of free space).

Now, I want to change the last byte of the file, and add 10k more to the file.  
Currently, a non-COW filesystem will simply send the 1 byte modification 
request and the 10k addition to the SSD (all as one unit, if you are lucky - if 
not, it comes as two ops: 1 byte modification followed by a 10k append).   The 
SSD now has to read all 25 blocks of the page back into it's local cache on the 
controller, do the modification and append computing, then writes out 28 blocks 
to NAND.  In all likelihood, if there is any extra pre-erased (or never written 
to) space on the drive, this 28 block write will go to a whole new page.  The 
blocks in the original page will be moved over to the SSD Free List (and may or 
may not be actually erased, depending on the controller).


Do you know for sure that you have SSD flash disks that
work this way? It seems incredibly stupid. It would also
use up the available erase cycles much faster than necessary.
What write speed do you get?

What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs workdifferently, but still have problems with what I'll call "excess-writing".

And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.
ZFS's propensity to fragmentation doesn't mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device's Free List.Now, in SSD's case, this isn't a worry. Due to the completely even performance characteristics of NAND, it doesn't make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD.
Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.

I'm not sure that SSDs actually _have_ to erase - they just overwriteanything there with new data. But this is implementation dependent, so Ican say how /all/ MLC SSDs behave.

(I don't think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the "end" of the disk, and erases the page.)

Well, the algorithms are often trade secrets, and if what you say
is correct, and it was my product, then I wouldn't even want to
tell anyone about it, since it would be a horrible waste of both
bandwidth and erase cycles. Using up the 10000 erase cycles of
a MLC device 64 times faster than necessary seems like an
extremely bad idea. But there sure is a lot of crap out there,
I can't say you are wrong (only hope :-).

I doubt for example the F5100 works that way, it would be hard to
get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that
(you typically can erase only 500-1000 pages a second, for
example).
I doubt the Intel X25 works that way, as their read performance
suffers if they are written with smaller blocks and get internally
fragmented - that problem could not exist if they always filled
complete new pages in a R-M-W manner.

Once again, what I'm talking about is a characteristic of MLC SSDs,which are used in most consumer SSDS (the Intel X25-M, included).Sure, such an SSD will commit any new writes to pages drawn from thelist of "never before used" NAND. However, at some point, this listbecomes empty. In most current MLC SSDs, there's about 10% "extra" (a60GB advertised capacity is actually ~54GB usable with 6-8GB "extra").Once this list is empty, the SSD has to start writing back to previousused pages, which may require an erase step first before any write.Which is why MLC SSDs slow down drastically once they've been fulled tocapacity several times.

I am not convinced the compute amount needs to be fixed, or
even that they by their nature need to be as cheap as possible -
if that hurts performance. People are obviously willing to pay
quite a lot to get high perf disk systems. The best flash SSDs
out there are quite expensive. In addition the number of
transistors per area (and monetary unit) tend to increase
with time (that intel guy had some saying about that... :-).

My point there is that if you build a controller for $X, that will getyou Y compute ability. For a dumb controller, less of this Y ability isgoing to be used up by "housekeeping" functions for the SSD, and morethus being available to manage I/O, than for a smart controller.

Put it another way: For a giving throughput performance of X, it willcost less to build a dumb controller than a smart controller. And,yes, price is a concern, even at the Enterprise level. Being able tobuild a dumb controller for 50% (or less) of the cost of a smartcontroller is likely to get you noticed by your consumers. Or at leastby your accountant, since your profit for the SSD will be higher.

I have not done the math here, but to me it isn't obvious that
the OS has spare cycles and bandwidth to do it, since space
reclaiming (compacting and erasing) could potentially draw much
more bandwidth than the actual workload, and since people have
had problem already with to few spare cycles on the X4500
if they want it to do something more than only being a
filer (and I guess is where there now is a X4550).
The filesystem/OS will most probably *not* have most of the
data in local ram when reclaiming space/compacting memory,
it will most likely have to read it in to write it out again.

/ragge

The whole point behind ZFS is that CPU cycles are cheap and available,much more so than dedicated hardware of any sort. What I'm arguing hereis that the controller on an SSD is in the same boat as a dedicated RAIDHBA - in the latter case, use a cheap HBA instead and let the CPU & ZFSdo the work, while in the former case, use a "dumb" controller for theSSD instead of a smart one.

I'm pretty sure compacting doesn't occur in ANY SSDs without any OSintervention (that is, the SSD itself doesn't do it), and I'd besurprised to see an OS try to implement some sort of intra-pagecompaction - there benefit doesn't seem to be there; it's better just tooptimize writes than try to compact existing pages. As far as reclaimingunused space, the TRIM command is there to allow the SSD to mark a pageFree for reuse, and an SSD isn't going to be erasing a page unless it'sright before something is to be written to that page.

The X4500 was specifically designed to be a filer. It has more thanenough CPU cycles to deal with pretty much all workloads it gets in thatarea - in fact, the major problem with the X4500 is insufficientresponse time of the SATA drives, which slows throughput. Sure you canrun other things on it, but it's really not designed for heavy-dutyextra workloads - it's a disk server, not a compute server. I've runcompressed zvols on it, and have no problem saturating the 4x1Gbitinterfaces while still not pegging both CPUs. I'd imaging that itstarts to run into problems with multiple 10Gbit Ethernet interfaces,but that's to be expected.

Bus bandwidth isn't really a concern, with SSDs using either SATA 3G orSAS 3G right now, and soon SATA 6G or SAS 12G in the near future.Likewise, system bus isn't much of an immediate concern, as pretty muchall SAS/SATA controllers use an 8x PCI-E attachment, for no more than 8devices (SAS controllers which support more than 8 devices almost alwayshave several 8x attachments).

And, as I pointed out in another message, doing it my way doesn'tincrease bus traffic that much over what is being done now, in any case.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

Reply via email to