> From: Deano [mailto:de...@rattie.demon.co.uk] > > Hi Edward, > Do you have a source for the 8KiB block size data? whilst we can't avoid the > SSD controller in theory we can change the smallest size we present to the > SSD to 8KiB fairly easily... I wonder if that would help the controller do a > better job (especially with TRIM) > > I might have to do some test, so far the assumption (even inside sun's sd > driver) is that SSD are really 4KiB even when the claim 512B, perhaps we > should have an 8KiB option...
It's hard to say precisely where the truth lies, so I'll just tell a story and take from it what you will. For me, it started when I started deploying new laptops with SSD's. There was a problem with the backup software, so I kept reimaging machines using "dd" and then backing up and restoring with acronis, and when it failed, I would restore again via dd, etc etc etc. So I kept overwriting the drive repeatedly. After only 2-3 iterations, the performance degraded to around 50% of its original speed. At work, we have a team of engineers who know flash intimately. So I asked them about flash performance degrading with usage. The first response was that each time it's erased and rewritten, the data isn't written as clearly as before. Like erasing pencil or chalkboard and rewriting over and over. It becomes "smudgy." So with repetition and age, the device becomes slower and consumes more power, because there's a higher incidence of errors and higher requirement for error correction and repeating the operations with varying operating parameters on the chips. All of this is invisible to the OS but affects performance internally. But then I said I was getting 50% loss after only 2-3 iterations, so this life degradation became clearly not the issue. This life degradation issue will become significant after tens of thousands, or higher number of iterations. They suggested the cause of the problem must be caused by something in the controller, not in the flash itself. So I kept working on it. I found this: http://www.pcper.com/article.php?aid=669&type=expert (see the section on Write Combining) Rather than reading that whole article ... The most valuable thing to come out of it is to identify useful search terms. The following are useful search terms: ssd "write combining" ssd internal fragmentation ssd sector remapping This is very similar to ZFS write aggregation. They're combining small writes into larger blocks and taking advantage of block remapping to keep track of it all. You gain performance during lots of small writes. It does not hurt you for lots of random small reads. But it does hurt you for sequential reads/writes that happen after the remapping. Also, unlike ZFS, the drive can't fully recover after the fact, when data gets deleted or moved or overwritten, etc. Unlike ZFS, the drive doesn't have any way to straighten itself out, except TRIM. After discovering this, I went back to the flash guys at work, and explained the internal fragmentation idea. One of the head engineers was there at the time, and he's the one who told me flash is made in 8k pages. "To flash manufacturers, SSD's are the pimple on the butt of the elephant" was his statement. Unfortunately, hard disks and OSes historically both used 512b sectors. Then hard drives started using 4k sectors but to maintain compatibility with OSes, they still emulate 512b on the interface. But the OS assumes the disk is doing this, so the OS aligns 512b writes to multiples of every 4k in order to avoid the read/modify/write. Unfortunately, now the SSD's are using 8k physical sector size, and emulating god knows what (4k or 512b) on the interface, so the RMW is once again necessary until the OSes become aware and start aligning on 8k pages instead... But then that doesn't even matter anymore either, thanks to sector remapping and write combining, even if your OS is intelligent enough, you're still going to end up with fragmentation anyway. Unless the OS pads every write to make up a full 8k page. But getting back to the point. The question I think you're asking, is to verify the existence of the 8k physical page inside the SSD. There are two ways to prove it that I can think of: (a) rip apart your SSD and hope you can read chip numbers and hope you can find specs of those chips to confirm or deny the 8k pages. or (b) TRIM your entire drive and see if it returns to original performance afterward. This can be done via hdderase, but that requires changing temporarily into ATA mode, booting from a DOS disk, and then putting it back into AHCI mode afterward... I went as far as putting into ATA mode, but then I found it was going to be a rathole for me to create the DOS disk, so I decided to call it quits and assume I had the right answer with a high enough degree of confidence. Since performance is only degraded for sequential operations, I will see degradation for OS rebuilds, but users probably won't notice. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss