Eric D. Midama did a very good job answering this, and I don't have much to add. Thanks Eric!
On 3 jan 2010, at 07.24, Erik Trimble wrote: > I think you're confusing erasing with writing. I am now quite certain that it actually was you who were confusing those. I hope this discussion has cleared things up a little though. > What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work > differently, but still have problems with what I'll call "excess-writing". Eric already said it, but I need to say this myself too: SLC and MLC disks could be almost identical, only the storing of the bits in the flash chips differs a little (1 or 2 bits per storage cell). There is absolutely no other fundamental difference between the two. Hopefully no modern MLC *or* SLC disk works as you described, since it is a horrible design, and selling it would be close to robbery. It would be slow and it would wear out quite fast. Now, SLC disks are typically better overall, because those who want to pay for SLC flash typically also want to pay for better controllers, but otherwise those issues are really orthogonal. > I'm not sure that SSDs actually _have_ to erase - they just overwrite > anything there with new data. But this is implementation dependent, so I can > say how /all/ MLC SSDs behave. As Eric said - yes you have to erase, otherwise you can't write new data. It is not implementation dependent, it is inherent in the flash technology. And, as has been said several times now, erasing can only be done in large chunks, but writing can be done in small chunks. I'd say that this is the main problem to handle when creating a good flash SSD. > The whole point behind ZFS is that CPU cycles are cheap and available, much > more so than dedicated hardware of any sort. What I'm arguing here is that > the controller on an SSD is in the same boat as a dedicated RAID HBA - in > the latter case, use a cheap HBA instead and let the CPU & ZFS do the work, > while in the former case, use a "dumb" controller for the SSD instead of a > smart one. This could be true, I am still not sure. My main issues with this is that it would make the file system code dependent of a special hardware behavior (that of todays flash chips), and that it could be quite a lot of data to shuffle around when compacting. But we'll see. If it could be cheap enough, it could absolutely happen and be worth it even if it has some drawbacks. > And, as I pointed out in another message, doing it my way doesn't increase > bus traffic that much over what is being done now, in any case. Yes, it would increase bus traffic, if you would handle flash the compacting in the host - which you have to with your idea - it could be many times the real workload bandwidth. But it could still be worth it, that is quite possible. --------- On 3 jan 2010, at 07.43, Erik Trimble wrote: > I meant to say that I DON'T know how all MLC drives deal with erasure. Again - yes they do. (Or they would be write-once only. :-) >> I'm pretty sure compacting doesn't occur in ANY SSDs without any OS >> intervention (that is, the SSD itself doesn't do it), and I'd be surprised >> to see an OS try to implement some sort of intra-page compaction - there >> benefit doesn't seem to be there; it's better just to optimize writes than >> try to compact existing pages. As far as reclaiming unused space, the TRIM >> command is there to allow the SSD to mark a page Free for reuse, and an SSD >> isn't going to be erasing a page unless it's right before something is to be >> written to that page. > My thinking of what compacting meant doesn't match up with what I'm seeing > general usage in the SSD technical papers is, so in this respect, I'm wrong: > compacting does occur, but only when there are no fully erased (or unused) > pages available. Thus, compacting is done in the context of a write > operation. Exactly what and when it is that triggers compacting is another issue, and that could probably change with firmware revisions. It is wise to do it earlier than when you get that write that didn't fit, since if you have some erased space you can then take burts of writes up to that size quickly. But compacting takes bandwidth from the flash chips and wears them out, so you don't want to do it to early and to much. I guess this could be an interesting optimization problem, and optimal behavior probably depends on the workload too. Maybe it should be an adjustable knob. --------- On 3 jan 2010, at 10.57, Eric D. Mudama wrote: > On Sat, Jan 2 at 22:24, Erik Trimble wrote: >> In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you >> have a Page size of several multiples of that, 128k being common, but by no >> means ubiquitous. > > I believe your terminology is crossed a bit. What you call a block is > usually called a sector, and what you call a page is known as a block. > > Sector is (usually) the unit of reading from the NAND flash. ... Indeed, and I am partly guilty to that mess, but I didn't want do change terminology in the middle of the discussion just to make it more flash-y. Maybe a mistake. :-) --------- Now, *my* view of how a typical, modern flash SSD works is as an appendable cyclic log. You can append blocks to it, but no two blocks can have the same address (the new block would mask away the old one), and there is a maximum address (dependent of the size of the disk), so the log has a maximum length. This has, in my head, some resemblance to the txg appending zfs does. On the inside, the flash SSD can't just rewrite new blocks to any free space because of the the way erasing works on large chunks, "erase blocks" in the flash chips of today. Therefore, it has to internally take "erase blocks" with freed space in it and move all active blocks to the end of the log to save them and compact them. It can then erase the "erase block", and reuse that area for new pages. This activity competes with the normal disk activities. There are of course other issues two, like wear leveling, bad block handling and stuff. /ragge _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss