Bob Friesenhahn wrote:
On Sat, 18 Apr 2009, Eric D. Mudama wrote:
What is tall about the SATA stack? There's not THAT much overhead in
SATA, and there's no reason you would need to support any legacy
transfer modes or commands you weren't interested in.
If SATA is much more than a memcpy() then it is excessive overhead for
a memory-oriented device. In fact, since the "device" is actually
comprised of quite a few independent memory modules, it should be
possible to schedule I/O for each independent memory module in
parallel. A large storage system will be comprised of tens, hundreds
or even thousands of independent memory modules so it does not make
sense to serialize access via legacy protocols. The larger the
storage device, the more it suffers from a serial protocol.
It's a mistake to think that flash looks similar to RAM. It doesn't in
lots of ways -- actually it looks more similar to a hard disk in many
respects;-)
It's true that you will find lots of flash memory modules on an SSD.
This is because they are slow, and in order to be able to make good use
of the available SATA bandwidth, many are paralleled up so the data can
be transferred in parallel to lots of them, so you are able to use a
good proportion of the SATA bandwidth (think of it like a mini RAID0
array. In the case of the SATA disks we sell for X and T series systems,
there are 10 parallel flash channels in each one, which enables the
device to achieve about 85% of the theoretical SATA bandwidth (which is
way higher than any single hard drive can do, except to its cache).
Also, like a hard disk, flash blocks go bad, and again like a disk, the
SSD has spare blocks to use as replacements, and includes bad block
handling logic in its controller to map these in when required. Over the
life of an Enterprise class SSD, the controller actually expects many
more flash block failures than you would ever see on a working a hard
disk, and there is consequently a much larger proportion of spare flash
memory included than a hard drive will normally have, in order to
achieve the same life. (Unlike a hard disk, blocks tend to die
gradually, so the flash controller can normally detect them getting weak
and map to replacement blocks long before any user data is lost.)
One departure from a hard disk is that flash blocks wear out according
to how much they're used. Most filesystems have blocks in some positions
which are used much more than others (e.g. superblocks, uberblocks,
etc), and these are normally really critical to the filesystem.
Designers of SSDs know that it would be completely unacceptable for such
critical blocks to fail quickly -- that would in effect mean the SSD had
a very short life, although most of it would still be fine when it
became useless. To counteract this, the on-board SSD controller
implements a feature called wear leveling. What this does is to move the
logical block numbers around on the physical flash blocks, so that all
the blocks wear at the same rate. So you can sit there continually
rewriting block 0, and you won't wear out the first flash block, as the
controller will move around where it stores block 0 in flash so all the
flash memory wears at the same rate, and you get longest possible life
from the SSD.
When you've considered these (and doubtless other) issues, it should
become clear why flash memory of the type we currently have available
makes good sense to build it into something resembling a disk. It really
looks nothing like DRAM memory. I'm sure that in time new flash
technologies will appear, and it may make sense to build them presenting
different interfaces.
--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss