Bob Friesenhahn wrote:
On Sat, 18 Apr 2009, Eric D. Mudama wrote:

What is tall about the SATA stack?  There's not THAT much overhead in
SATA, and there's no reason you would need to support any legacy
transfer modes or commands you weren't interested in.

If SATA is much more than a memcpy() then it is excessive overhead for a memory-oriented device. In fact, since the "device" is actually comprised of quite a few independent memory modules, it should be possible to schedule I/O for each independent memory module in parallel. A large storage system will be comprised of tens, hundreds or even thousands of independent memory modules so it does not make sense to serialize access via legacy protocols. The larger the storage device, the more it suffers from a serial protocol.

It's a mistake to think that flash looks similar to RAM. It doesn't in lots of ways -- actually it looks more similar to a hard disk in many respects;-)

It's true that you will find lots of flash memory modules on an SSD. This is because they are slow, and in order to be able to make good use of the available SATA bandwidth, many are paralleled up so the data can be transferred in parallel to lots of them, so you are able to use a good proportion of the SATA bandwidth (think of it like a mini RAID0 array. In the case of the SATA disks we sell for X and T series systems, there are 10 parallel flash channels in each one, which enables the device to achieve about 85% of the theoretical SATA bandwidth (which is way higher than any single hard drive can do, except to its cache).

Also, like a hard disk, flash blocks go bad, and again like a disk, the SSD has spare blocks to use as replacements, and includes bad block handling logic in its controller to map these in when required. Over the life of an Enterprise class SSD, the controller actually expects many more flash block failures than you would ever see on a working a hard disk, and there is consequently a much larger proportion of spare flash memory included than a hard drive will normally have, in order to achieve the same life. (Unlike a hard disk, blocks tend to die gradually, so the flash controller can normally detect them getting weak and map to replacement blocks long before any user data is lost.)

One departure from a hard disk is that flash blocks wear out according to how much they're used. Most filesystems have blocks in some positions which are used much more than others (e.g. superblocks, uberblocks, etc), and these are normally really critical to the filesystem. Designers of SSDs know that it would be completely unacceptable for such critical blocks to fail quickly -- that would in effect mean the SSD had a very short life, although most of it would still be fine when it became useless. To counteract this, the on-board SSD controller implements a feature called wear leveling. What this does is to move the logical block numbers around on the physical flash blocks, so that all the blocks wear at the same rate. So you can sit there continually rewriting block 0, and you won't wear out the first flash block, as the controller will move around where it stores block 0 in flash so all the flash memory wears at the same rate, and you get longest possible life from the SSD.

When you've considered these (and doubtless other) issues, it should become clear why flash memory of the type we currently have available makes good sense to build it into something resembling a disk. It really looks nothing like DRAM memory. I'm sure that in time new flash technologies will appear, and it may make sense to build them presenting different interfaces.

--
Andrew
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to