[zfs-discuss] Ideas for ghetto file server data reliability?

VO Mon, 15 Nov 2010 01:57:18 -0800

Hello List,

I recently got bitten by a "panic on `zpool import`" problem (same CR
6915314), while testing a ZFS file server. Seems the pool is pretty much
gone, did try
- zfs:zfs_recover=1 and aok=1 in /etc/system
- `zpool import -fF -o ro`
to no avail. I don't think I will be taking the time trying to fix it unless
someone has good ideas. I suspect bad data was written to the pool and seems
there is no way to recover; fmdump shows problem with same block on all
disks IIRC.


The server hardware is pretty ghetto with whitebox components such as
non-ECC RAM (cause of the pool loss). I know the hardware sucks but
sometimes non-technical people don't understand the value of data before it
is lost.. I was lucky the system had not been sent out yet and the project
was "simply" delayed.

In light of this experience, I would say raidz is not useful in certain
hardware failure scenarios. Bad bit in the RAM at the wrong time and the
whole pool is lost.

Does the list have any ideas on how to make this kind of ghetto system more
resilient (short of buy ECC RAM and mobo for it)?

I was thinking something like this:
- pool1: raidz pool for the bulk data
- pool2: mirror pool for backing up the raidz pool, only imported when the
copying pool1 to pool2

What would be the most reliable way to copy the data from pool1 to pool2
keeping in mind "bad bit in RAM and everything is lost"? I worry most about
corrupting the pool2 also if pool1 has gone bad or there is a similar
hardware failure again. Or is this whole idea just added complexity with no
real benefit?


Regards,

Ville

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Ideas for ghetto file server data reliability?

Reply via email to