Re: [zfs-discuss] Thin device support in ZFS?

Erik Trimble Tue, 05 Jan 2010 20:48:34 -0800

As a further update, I went back and re-read my SSD controller info, andthen did some more Googling.

Turns out, I'm about a year behind on State-of-the-SSD. Eric iscorrect on the way current SSDs implement writes (both SLC and MLC), soI'm issuing a mea-cupla here. The change in implementation appears tooccur sometime shortly after the introduction of the Indilinxcontrollers. My fault for not catching this.


-Erik



Eric D. Mudama wrote:

On Sat, Jan  2 at 22:24, Erik Trimble wrote:
In MLC-style SSDs, you typically have a block size of 2k or 4k.However, you have a Page size of several multiples of that, 128kbeing common, but by no means ubiquitous.
I believe your terminology is crossed a bit.  What you call a block is
usually called a sector, and what you call a page is known as a block.

Sector is (usually) the unit of reading from the NAND flash.

The unit of write in NAND flash is the page, typically 2k or 4k
depending on NAND generation, and thus consisting of 4-8 ATA sectors
(typically).  A single page may be written at a time.  I believe some
vendors support partial-page programming as well, allowing a single
sector "append" type operation where the previous write left off.

Ordered pages are collected into the unit of erase, which is known as
a block (or "erase block"), and is anywhere from 128KB to 512KB or
more, depending again on NAND generation, manufacturer, and a bunch of
other things.

Some large number of blocks are grouped by chip enables, often 4K or
8K blocks.
I think you're confusing erasing with writing.
When I say "minimum write size", I mean that for an MLC, no matterhow small you make a change, the minimum amount of data actuallybeing written to the SSD is a full page (128k in my example). There
Page is the unit of write, but it's much smaller in all NAND I am
aware of.
is no "append" down at this level. If I have a page of 128k, withdata in 5 of the 4k blocks, and I then want to add another 2k of datato this, I have to READ all 5 4k blocks into the controller's DRAM,add the 2k of data to that, then write out the full amount to a newpage (if available), or wait for a older page to be erased beforewriting to it. Thus, in this case, in order to do an actual 2kwrite, the SSD must first read 10k of data, do some compositing, thenwrite 12k to a fresh page.
Thus, to change any data inside a single page, then entire contentsof that page have to be read, the page modified, then the entire pagewritten back out.
See above.
What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDswork differently, but still have problems with what I'll call"excess-writing".
I think you're only describing dumb SSDs with erase-block granularity
mapping. Most (all) vendors have moved away from that technique since
random write performance is awful in those designs and they fall over
dead from wAmp in a jiffy.

SLC and MLC NAND is similar, and they are read/written/erased almost
identically by the controller.
I'm not sure that SSDs actually _have_ to erase - they just overwriteanything there with new data. But this is implementation dependent,so I can say how /all/ MLC SSDs behave.
Technically you can program the same NAND page repeatedly, but since
bits can only transition from 1->0 on a program operation, the result
wouldn't be very meaningful.  An erase sets all the bits in the block
to 1, allowing you to store your data.
Once again, what I'm talking about is a characteristic of MLC SSDs,which are used in most consumer SSDS (the Intel X25-M, included).
Sure, such an SSD will commit any new writes to pages drawn from thelist of "never before used" NAND. However, at some point, this listbecomes empty. In most current MLC SSDs, there's about 10% "extra"(a 60GB advertised capacity is actually ~54GB usable with 6-8GB"extra"). Once this list is empty, the SSD has to start writingback to previous used pages, which may require an erase step firstbefore any write. Which is why MLC SSDs slow down drastically oncethey've been fulled to capacity several times.
From what I've seen, erasing a block typically takes a time in the
same scale as programming an MLC page, meaning in flash with large
page counts per block, the % of time spent erasing is not very large.

Lets say that an erase took 100ms and a program took 10ms, in an MLC
NAND device with 100 pages per block.  In this design, it takes us 1s
to program the entire block, but only 1/10 of the time to erase it.
An infinitely fast erase would only make the design about 10% faster.

For SLC the erase performance matters more since page writes are much
faster on average and there are half as many pages, but we were
talking MLC.

The performance differences seen is because they were artificially
fast to begin with because they were empty.  It's similar to
destroking a rotating drive in many ways to speed seek times.  Once
the drive is full, it all comes down to raw NAND performance,
controller design, reserve/extra area (or TRIM) and algorithmic
quality.

--eric



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

Reply via email to