Re: [OpenIndiana-discuss] Broken zpool

jason matthews Wed, 28 Oct 2015 13:48:07 -0700


Let me apologize in advanced for inter-mixing comments.



On 10/27/15 7:44 PM, Rainer Heilke wrote:

I am not trying to be a dick (it happens naturally), but if you cant
afford to backup terabytes of data, then you cant afford to have
terabytes of data.
That is a meaningless statement, that reflects nothing in real-worldterms.

The true cost of a byte of data that you care about is the money you payfor the initial storage, and then the money you pay to back it up. Forwork, my front line databases have 64TB of mirrored net storage. Thecosts dont stop there. There is another 200TB of net storage dedicatedto holding enough log data to rebuild the last 18 months from scratch. Ialso have two sets of slaves that snapshot themselves frequently. Oneset is a single disk, the other is raidz. These are not just backups.One set runs batch jobs, one runs the front end portal, and the mastersare in charge of data ingestions.

The slaves are useful backups for zpool corruption on the front end butnot necessarily for human error. For human error, say where someonedestroys a table that replicates across all the slaves and some how isntnoticed until all the snapshots are deleted then we have the logs. Ihave different kinds of backups taken at different intervals to handledifferent kids of failures. Some are are live, some are snapshots, andsome are source data. You need to determine your level of risktolerance. That might mean using zfs send/recv to two different zpoolswith the same or different protection levels.

If you dont backup, you set yourself up for unrecoverable problems. Infour years of running high transaction, high throughput databases on ZFSI have had to rebuild pools from time to time for different reasons.However, never for corruption. I have had other problems like unbalancedwrite load across vdevs and metaslab fragmentation. My point is, dontunder estimate the costs of maintaining a byte of data. You might needthe backup one day, even with protections that ZFS provides.

That said, instead of running mirrors run loose disks and backup to thesecond pool at a frequency you are comfortable with. You need toprioritize your resources against your risk tolerance. It is tempting todo mirrors because it is sexy but that might not be the best strategy.

This is just good stewardship of data you want to keep.
That's an arrogant statement, presuming that if a person doesn't havegobs of money, they shouldn't bother with computers at all.

I didnt write anything like that. What I am saying is you need to getmore creative on how to protect your data. Yes, money makes it easierbut you have options.

People who buy giant ass disks and then complain about how long it takes
to resilver a giant ass disk are out of their minds.
I am not complaining about the time it takes; I know full well howlong it can take. I am complaining that the "resilvering" stops dead.(More on this below.)

This is trickier. I dont recall you saying it stops dead. I thought itwas just "slow."

When the scrub is stopped dead, what does "iostat -nMxC 1" look like?Are there drives indicating 100% busy? high wait or asvc_t times?


Do you have any controller errors? Does iostat -En report any errors?

Have you tried mounting the pool ro, stopping the scrub, and thencopying data off?

Here are some hail mary settings that probably wont help. I offer them(in no particular order) to try to improve the scrub time performance,minimized the number of enqueued I/Os in case that is exacerbating theproblem some how, and attempt to limit the amount of time spent on afailing I/O. Your scrubs may be stopping because you have a disk thatexhibiting a poor failure mode. Namely, some sort of internal error andit just keeps retrying which makes the pool wedge. WD is not the brand Igo to for enterprise failure modes.


* dont spend more than 8ms on any single i/o
set sd:sd_io_time=8
* resilver in 5 second intervals minimum
set zfs:zfs_resilver_min_time_ms = 5000
set zfs:zfs_resilver_delay = 0
* enqueue only 5 I/Os
set zfs:zfs_top_maxinflight = 2

Apply these settings and try to resilver again. If this doesnt work, ddthe drives to new ones. Using dd will likely identify which drive iswedging ZFS as it will either not complete or it will error out.

I have no idea what happened to your system for you to loose three disks
simultaneously.
This was covered in a thread ages ago; the tech took days to find theproblem, which was a CMOS battery that was on Death's door.

I am not sure who the tech is, but at least two people on this list toldyou check the CMOS battery. I think Bob and I both recommended changingthe battery. Others might have as well.

I just dont see you recovering from this scenario where you have
two bad drives trying to resilver from each other.
They aren't trying to resilver from each other. The dead disk is gone.The good disk is trying to resilver from the ether. Or some such.(Itself?) I added a third drive to the mirror in a vain attempt to getpast the error saying there weren't enough remaining mirrors when Itried to zpool detach the now non-existent drive. Again, what is ITtrying to resilver from? The same Twilight Zone the first disk istrying to resilver from?

I reviewed your output again. You have two disks in a mirror. Each diskis resilvering. This means the two disks are resilvering from eachother. There are no other options as there are no other vdevs in the pool.

It seems to think that the one disk is fine, but the data isn't. ZFSis then locking the pool's I/O, not letting me clear up the damagedfiles (nor the pool). It's like there's a trapped loop between twoparts of the ZFS code, but I refuse to believe Cantrill (and the manyprogrammers since) didn't see this kind of problem.

ZFS suspects both disks could be dirty, that's why it is resilveringboth drives. This puts the drives under heavy load. This load isprobably surfacing an internal error on one or more drives but becauseWD has crappy failure modes it is not sending the error to OS.Internally, the drive keeps retrying with errors and the OS keepswaiting for the drive to return from write flush. This is likely what iswedging the pool. The problem is likely on the drive -- but I cant saythat with certainty. Certainty is a big word.

There is another option, which has the potential to make your datadiverge across the two disks if you dont mount them read-only.

Basically, reboot the system with just one of the vdevs installed andmounted read-only. There will be nothing for the system to resilver. Youshould be able to copy your data off to another disk, (or delete thecorrupted files to restore send/recv functionality) if the disk youchoose at random is working properly. If it is not working properly,wash-rinse-repeat with the other disk. Hopefully that one will work. Ifneither work, try changing cables or controllers although there is achance that both WD drives are failed. I have had bad luck with WD.

If you are in the SF bay area and want to bring it by my office I amhappy to take a stab it, after you back up your original disks (forliability reasons). I can provide a clean working "text book" system ifneeded, bring your system and drives and we can likely salvage it oneway or another.


j.







_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] Broken zpool

Reply via email to