Eric D. Midama did a very good job answering this, and I don't have
much to add. Thanks Eric!

On 3 jan 2010, at 07.24, Erik Trimble wrote:

> I think you're confusing erasing with writing.

I am now quite certain that it actually was you who were
confusing those. I hope this discussion has cleared things
up a little though.

> What I'm describing is how ALL MLC-based SSDs work. SLC-based SSDs work 
> differently, but still have problems with what I'll call "excess-writing".

Eric already said it, but I need to say this myself too:
SLC and MLC disks could be almost identical, only the storing of
the bits in the flash chips differs a little (1 or 2 bits per
storage cell). There is absolutely no other fundamental difference
between the two.

Hopefully no modern MLC *or* SLC disk works as you described,
since it is a horrible design, and selling it would be close to
robbery. It would be slow and it would wear out quite fast.

Now, SLC disks are typically better overall, because those who
want to pay for SLC flash typically also want to pay for better
controllers, but otherwise those issues are really orthogonal.

> I'm not sure that SSDs actually _have_ to erase - they just overwrite 
> anything there with new data. But this is implementation dependent, so I can 
> say how /all/ MLC SSDs behave.

As Eric said - yes you have to erase, otherwise you can't write
new data. It is not implementation dependent, it is inherent in
the flash technology. And, as has been said several times now,
erasing can only be done in large chunks, but writing can be done
in small chunks. I'd say that this is the main problem to handle
when creating a good flash SSD.

> The whole point behind ZFS is that CPU cycles are cheap and available, much 
> more so than dedicated hardware of any sort. What I'm arguing here is that 
> the controller on an SSD is in the same boat as a dedicated RAID HBA -  in 
> the latter case, use a cheap HBA instead and let the CPU & ZFS do the work, 
> while in the former case, use a "dumb" controller for the SSD instead of a 
> smart one.

This could be true, I am still not sure. My main issues with this
is that it would make the file system code dependent of a special
hardware behavior (that of todays flash chips), and that it could
be quite a lot of data to shuffle around when compacting. But
we'll see. If it could be cheap enough, it could absolutely happen
and be worth it even if it has some drawbacks.

> And, as I pointed out in another message, doing it my way doesn't increase 
> bus traffic that much over what is being done now, in any case.

Yes, it would increase bus traffic, if you would handle flash the
compacting in the host - which you have to with your idea - it could
be many times the real workload bandwidth. But it could still be
worth it, that is quite possible.

---------

On 3 jan 2010, at 07.43, Erik Trimble wrote:
> I meant to say that I DON'T know how all MLC drives deal with erasure.

Again - yes they do. (Or they would be write-once only. :-)

>> I'm pretty sure compacting doesn't occur in ANY SSDs without any OS 
>> intervention (that is, the SSD itself doesn't do it), and I'd be surprised 
>> to see an OS try to implement some sort of intra-page compaction - there 
>> benefit doesn't seem to be there; it's better just to optimize writes than 
>> try to compact existing pages. As far as reclaiming unused space, the TRIM 
>> command is there to allow the SSD to mark a page Free for reuse, and an SSD 
>> isn't going to be erasing a page unless it's right before something is to be 
>> written to that page.
> My thinking of what compacting meant doesn't match up with what I'm seeing 
> general usage in the SSD technical papers is, so in this respect, I'm wrong:  
> compacting does occur, but only when there are no fully erased (or unused) 
> pages available.  Thus, compacting is done in the context of a write 
> operation.

Exactly what and when it is that triggers compacting is another
issue, and that could probably change with firmware revisions.

It is wise to do it earlier than when you get that write that
didn't fit, since if you have some erased space you can then take
burts of writes up to that size quickly. But compacting takes
bandwidth from the flash chips and wears them out, so you don't
want to do it to early and to much.

I guess this could be an interesting optimization problem, and
optimal behavior probably depends on the workload too. Maybe it
should be an adjustable knob.

---------

On 3 jan 2010, at 10.57, Eric D. Mudama wrote:

> On Sat, Jan  2 at 22:24, Erik Trimble wrote:
>> In MLC-style SSDs, you typically have a block size of 2k or 4k. However, you 
>> have a Page size of several multiples of that, 128k being common, but by no 
>> means ubiquitous.
> 
> I believe your terminology is crossed a bit.  What you call a block is
> usually called a sector, and what you call a page is known as a block.
> 
> Sector is (usually) the unit of reading from the NAND flash.
...

Indeed, and I am partly guilty to that mess, but I didn't want do
change terminology in the middle of the discussion just to make it
more flash-y. Maybe a mistake. :-)

---------

Now, *my* view of how a typical, modern flash SSD works is as an
appendable cyclic log. You can append blocks to it, but no two
blocks can have the same address (the new block would mask away
the old one), and there is a maximum address (dependent of the
size of the disk), so the log has a maximum length.

This has, in my head, some resemblance to the txg appending zfs
does.

On the inside, the flash SSD can't just rewrite new blocks to
any free space because of the the way erasing works on large
chunks, "erase blocks" in the flash chips of today. Therefore, it
has to internally take "erase blocks" with freed space in it and
move all active blocks to the end of the log to save them and
compact them. It can then erase the "erase block", and reuse
that area for new pages. This activity competes with the normal
disk activities.

There are of course other issues two, like wear leveling,
bad block handling and stuff.

/ragge

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to