Christian Auby wrote:
On Wed, 8 Jul 2009, Moore, Joe wrote:
That's true for the worst case, but zfs mitigates
that somewhat by batching i/o into a transaction group. This means that i/o is done every 30 seconds (or 5 seconds, depending on the version you're running), allowing multiple writes to be written together in the disparate locations.


I'd think that writing the same data two or three times is a much larger 
performance hit anyway. Calculating 5% parity and writing it in addition to the 
stripe might be heaps faster. Might try to do some tests on this.

Before you get too happy, you should look at the current constraints.
The minimum disk block size is 512 bytes for most disks, but there has
been talk in the industry of cranking this up to 2 or 4 kBytes.  For small
files, your 5% becomes 100%, and you might as well be happy now and
set copies=2.  The largest ZFS block size is 128 kBytes, so perhaps you
could do something with 5% overhead there, but you couldn't correct
very many bits with only 5%. How many bits do you need to correct?
I don't know... that is the big elephant in the room shaped like a question
mark.  Maybe zcksummon data will help us figure out what color the
elephant might be.

If you were to implement something at the DMU layer, which is where
copies are, then without major structural changes to the blkptr, you are
restricted to 3 DVAs.  So the best you could do there is 50% overhead,
which would be a 200% overhead for small files.

If you were to implement at the SPA layer, then you might be able to
get back to a more consistently small overhead, but that would require
implementing a whole new vdev type, which means integration with
install, grub, and friends.  You would need to manage spatial diversity,
which might impact the allocation code in strange ways, but surely is
possible. The spatial diversity requirement means you basically can't
gain much by replacing a compressor with additional data redundancy,
though it might be an interesting proposal for the summer of code.

Or you could just do it in user land, like par2.

Bottom line: until you understand the failure modes you're trying
to survive, you can't make significant progress except by accident.
We know that redundant copies allows us to correct all bits for very
little performance impact, but costs space. Trying to increase space
without sacrificing dependability will cost something -- most likely
performance.

NB, one nice thing about copies is that you can set it per-file system.
For my laptop, I don't set copies for the OS, but I do for my home
directory.  This is a case where I trade off dependability of read-only
data which, is available on CD or on the net, to gain a little bit of
space.  But I don't compromise on dependability for my data.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to