Ragnar Sundblad wrote:
On 3 jan 2010, at 04.19, Erik Trimble wrote:
Let's say I have 4k blocks, grouped into a 128k page.  That is, the SSD's 
fundamental minimum unit size is 4k, but the minimum WRITE size is 128k.  Thus, 
32 blocks in a page.
Do you know of SSD disks that have a minimum write size of
128 KB? I don't understand why it would be designed that way.

A typical flash chip has pretty small write block sizes, like
2 KB or so, but they can only erase in pages of 128 KB or so.
(And then you are running a few of those in parallel to get some
speed, so these numbers often multiply with the number of
parallel chips, like 4 or 8 or so.)
Typically, you have to write the 2 KB blocks consecutively
in a page. Pretty much all set up for an append-style system.
:-)

In addition, flash SSDs typically have some DRAM write buffer
that buffers up writes (like a txg, if you will), so small
writes should not be a problem - just collect a few and append!
In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you have a Page size of several multiples of that, 128k being common, but by no means ubiquitous.

I think you're confusing erasing with writing.

When I say "minimum write size", I mean that for an MLC, no matter how small you make a change, the minimum amount of data actually being written to the SSD is a full page (128k in my example). There is no "append" down at this level. If I have a page of 128k, with data in 5 of the 4k blocks, and I then want to add another 2k of data to this, I have to READ all 5 4k blocks into the controller's DRAM, add the 2k of data to that, then write out the full amount to a new page (if available), or wait for a older page to be erased before writing to it. Thus, in this case, in order to do an actual 2k write, the SSD must first read 10k of data, do some compositing, then write 12k to a fresh page. Thus, to change any data inside a single page, then entire contents of that page have to be read, the page modified, then the entire page written back out.



So, I write a bit of data 100k in size. This occupies the first 25 blocks in 
the one page. The remaining 9 blocks are still one the SSD's Free List (i.e. 
list of free space).

Now, I want to change the last byte of the file, and add 10k more to the file.  
Currently, a non-COW filesystem will simply send the 1 byte modification 
request and the 10k addition to the SSD (all as one unit, if you are lucky - if 
not, it comes as two ops: 1 byte modification followed by a 10k append).   The 
SSD now has to read all 25 blocks of the page back into it's local cache on the 
controller, do the modification and append computing, then writes out 28 blocks 
to NAND.  In all likelihood, if there is any extra pre-erased (or never written 
to) space on the drive, this 28 block write will go to a whole new page.  The 
blocks in the original page will be moved over to the SSD Free List (and may or 
may not be actually erased, depending on the controller).

Do you know for sure that you have SSD flash disks that
work this way? It seems incredibly stupid. It would also
use up the available erase cycles much faster than necessary.
What write speed do you get?
What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work differently, but still have problems with what I'll call "excess-writing".


And as far as I know, zfs can not do that today - it can not
move around already written data, not for defragmentation, not
for adding or removing disks to stripes/raidz:s, not for
deduping/duping and so on, and I have understood it as
BP Rewrite could solve a lot of this.
ZFS's propensity to fragmentation doesn't mean you lose space. Rather, it means that COW often results in frequently-modified files being distributed over the entire media, rather than being contiguous. So, over time, the actual media has very little (if any) contiguous free space, which is what the fragmentation problem is. BP rewrite will indeed allow us to create a de-fragger. Areas which used to hold a ZFS block (now vacated by a COW to somewhere else) are simply added back to the device's Free List. Now, in SSD's case, this isn't a worry. Due to the completely even performance characteristics of NAND, it doesn't make any difference if the physical layout of a file happens to be sections (e.g. ZFS blocks) scattered all over the SSD.

Yes, there is something to worry about, as you can only
erase flash in large pages - you can not erase them only where
the free data blocks in the Free List are.
I'm not sure that SSDs actually _have_ to erase - they just overwrite anything there with new data. But this is implementation dependent, so I can say how /all/ MLC SSDs behave.

(I don't think they typically merge pages, I believe they rather
just pick pages with some freed blocks, copies the active blocks
to the "end" of the disk, and erases the page.)

Well, the algorithms are often trade secrets, and if what you say
is correct, and it was my product, then I wouldn't even want to
tell anyone about it, since it would be a horrible waste of both
bandwidth and erase cycles. Using up the 10000 erase cycles of
a MLC device 64 times faster than necessary seems like an
extremely bad idea. But there sure is a lot of crap out there,
I can't say you are wrong (only hope :-).

I doubt for example the F5100 works that way, it would be hard to
get ~15000 4KB w/s per "flash-SODIMM" if it behaved like that
(you typically can erase only 500-1000 pages a second, for
example).
I doubt the Intel X25 works that way, as their read performance
suffers if they are written with smaller blocks and get internally
fragmented - that problem could not exist if they always filled
complete new pages in a R-M-W manner.
Once again, what I'm talking about is a characteristic of MLC SSDs, which are used in most consumer SSDS (the Intel X25-M, included). Sure, such an SSD will commit any new writes to pages drawn from the list of "never before used" NAND. However, at some point, this list becomes empty. In most current MLC SSDs, there's about 10% "extra" (a 60GB advertised capacity is actually ~54GB usable with 6-8GB "extra"). Once this list is empty, the SSD has to start writing back to previous used pages, which may require an erase step first before any write. Which is why MLC SSDs slow down drastically once they've been fulled to capacity several times.

I am not convinced the compute amount needs to be fixed, or
even that they by their nature need to be as cheap as possible -
if that hurts performance. People are obviously willing to pay
quite a lot to get high perf disk systems. The best flash SSDs
out there are quite expensive. In addition the number of
transistors per area (and monetary unit) tend to increase
with time (that intel guy had some saying about that... :-).
My point there is that if you build a controller for $X, that will get you Y compute ability. For a dumb controller, less of this Y ability is going to be used up by "housekeeping" functions for the SSD, and more thus being available to manage I/O, than for a smart controller.

Put it another way: For a giving throughput performance of X, it will cost less to build a dumb controller than a smart controller. And, yes, price is a concern, even at the Enterprise level. Being able to build a dumb controller for 50% (or less) of the cost of a smart controller is likely to get you noticed by your consumers. Or at least by your accountant, since your profit for the SSD will be higher.

I have not done the math here, but to me it isn't obvious that
the OS has spare cycles and bandwidth to do it, since space
reclaiming (compacting and erasing) could potentially draw much
more bandwidth than the actual workload, and since people have
had problem already with to few spare cycles on the X4500
if they want it to do something more than only being a
filer (and I guess is where there now is a X4550).
The filesystem/OS will most probably *not* have most of the
data in local ram when reclaiming space/compacting memory,
it will most likely have to read it in to write it out again.

/ragge
The whole point behind ZFS is that CPU cycles are cheap and available, much more so than dedicated hardware of any sort. What I'm arguing here is that the controller on an SSD is in the same boat as a dedicated RAID HBA - in the latter case, use a cheap HBA instead and let the CPU & ZFS do the work, while in the former case, use a "dumb" controller for the SSD instead of a smart one.

I'm pretty sure compacting doesn't occur in ANY SSDs without any OS intervention (that is, the SSD itself doesn't do it), and I'd be surprised to see an OS try to implement some sort of intra-page compaction - there benefit doesn't seem to be there; it's better just to optimize writes than try to compact existing pages. As far as reclaiming unused space, the TRIM command is there to allow the SSD to mark a page Free for reuse, and an SSD isn't going to be erasing a page unless it's right before something is to be written to that page.



The X4500 was specifically designed to be a filer. It has more than enough CPU cycles to deal with pretty much all workloads it gets in that area - in fact, the major problem with the X4500 is insufficient response time of the SATA drives, which slows throughput. Sure you can run other things on it, but it's really not designed for heavy-duty extra workloads - it's a disk server, not a compute server. I've run compressed zvols on it, and have no problem saturating the 4x1Gbit interfaces while still not pegging both CPUs. I'd imaging that it starts to run into problems with multiple 10Gbit Ethernet interfaces, but that's to be expected.

Bus bandwidth isn't really a concern, with SSDs using either SATA 3G or SAS 3G right now, and soon SATA 6G or SAS 12G in the near future. Likewise, system bus isn't much of an immediate concern, as pretty much all SAS/SATA controllers use an 8x PCI-E attachment, for no more than 8 devices (SAS controllers which support more than 8 devices almost always have several 8x attachments).

And, as I pointed out in another message, doing it my way doesn't increase bus traffic that much over what is being done now, in any case.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to