Re: [zfs-discuss] x4540 boot flash

Richard Elling Sun, 07 Jun 2009 08:31:57 -0700

Paul B. Henson wrote:

On Sat, 6 Jun 2009, Richard Elling wrote:

The presumption is that you are using UFS for the CF, not ZFS.
UFS is not COW, so there is a potential endurance problem for
blocks which are known to be rewritten many times.  ZFS will not
have this problem, so if you use ZFS root, you are better served by
ignoring the previous "advice."


My understanding was that all modern CF cards incorporate wear leveling,
and I was interpreting the recommendation as trying to prevent wearing out
the entire card, not necessarily particular blocks.


Wear leveling is an attempt to solve the problem of multiple writes
to the same physical block.

of writes to the swap device.  For OpenSolaris (enterprise support
contracts now available!) which uses ZFS for swap, don't worry, be


As of U6, even luddite S10 users can avail of zfs for boot/swap/dump:

r...@ike ~ # uname -a
SunOS ike 5.10 Generic_138889-08 i86pc i386 i86pc

r...@ike ~ # swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/ospool/swap 181,2       8 8388600 8388600


Yes, and as you can see, my attempts to get the verbiage changed
have failed :-(

In short, if you use ZFS for root, ignore the warnings.


How about the lack of redundancy? Is the failure rate for CF so low there's
no risk in running a critical server without a mirrored root pool? And what
about bit rot? Without redundancy zfs can only detect but not correct read
errors (unless, I suppose, configured with copies>1). How much more would
it have cost to include two CF slots that it wasn't warranted?


The failure rate is much lower than disks, with the exception of the
endurance problem.  Flash memory is not susceptible to the bit rot
that plaques magnetic media.  Nor is flash memory susceptible to
the radiation-induced bit flips that plague DRAMs.

Or, to look at this another way, billions of consumer electronics
devices use a single flash "boot disk" and there doesn't seem to
be many people complaining they aren't mirrored.  Indeed, even
if you have a mirrored OS on flash, you don't have a mirrored
OBP or BIOS (which is also on flash).  So, the risk here is
significantly lower than HDDs.

5 GBytes seems pretty large for a slog, but yes, I think this is a good
idea.


What is the best formula to calculate slog size? I found a recent thread:

        http://jp.opensolaris.org/jive/thread.jspa?threadID=78758&tstart=1

in which a Sun engineer (presumably unofficially of course ;) ) mentioned
10-18GB as more than sufficent. On the other hand:

        
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

says "A rule of thumb is that you should size the separate log to be able
to handle 10 seconds of your expected synchronous write workload. It would
be rare to need more than 100 MBytes in a separate log device, but the
separate log must be at least 64 MBytes."


This was a ROT when the default txg sync time was 5 seconds...
I'll update this soon because that is no longer the case.

Big gap between 100MB and 10-18GB. The first thread also mentioned in
passing that splitting up an SSD between slog and root pool might have
undesirable performance issues, although I don't think that was discussed
to resolution.


Yep, big gap.  This is why I wrote zilstat, so that you can see what your
workload might use before committing to a slog.  There may be a good
zilstat RFE here: I can see when the txg commits, so zilstat should be able
to collect per-txg rather than per-time-period.  Consider it added to my
todo list.
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

CFs designed for the professional photography market have better
specifications than CFs designed for the consumer market.


CF is pretty cheap, you can pick up 16GB-32GB from $80-$200 depending on
brand/quality. Assuming they do incorporate wear leveling, and considering
even a fairly busy server isn't going to use up *that* much space (I have a
couple E3000's still running which have 4GB disk mirrors for the OS), if
you get a decent CF card I suppose it would quite possibly outlast the
server.


I think the dig against CF is that they tend to have a low write speed
for small iops.  They are optimized for writing large files, like photos.

But I think I'd still rather have two 8-/. Show of hands, anybody with an
x4540 that's booting off non-redundant CF?

This is not an accurate statement.  Enterprise-class SSDs (eg. STEC Zeus)
have DRAM write buffers.  The Flash Mini-DIMMs Sun uses also have DRAM
write buffers.  These offer very low write latency for slogs.


Yah, that misconception has already been pointed out to me offlist. I
actually came upon it in correspondence with you, I had asked about using a
slice of an SSD for a slog rather than the whole disk, and you mentioned
that the advice for using the whole disk rather than a slice was only for
traditional spinning hard drives and didn't apply to SSD's, I thought
because of something to do with the write cache but I guess I
misunderstood. I didn't save that message, perhaps you could be kind
enough to refresh my memory as to why slices of SSD's are ok while slices
of hard disks are best avoided?


In the enterprise class SSDs, the DRAM buffer is nonvolatile.  In HDDs,
the DRAM buffer is volatile. HDDs will flush their DRAM buffer if you
give it the command to do so, which is what ZFS will do when it owns
the whole disk.  This "design feature" is the cause of much confusion over
the years, though.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] x4540 boot flash

Reply via email to