On 28/10/2015 1:47 PM, jason matthews wrote:
Let me apologize in advanced for inter-mixing comments.
Ditto.
I am not trying to be a dick (it happens naturally), but if you cant
afford to backup terabytes of data, then you cant afford to have
terabytes of data.
That is a meaningless statement, that reflects nothing in real-world
terms.
The true cost of a byte of data that you care about is the money you pay
for the initial storage, and then the money you pay to back it up. For
work, my front line databases have 64TB of mirrored net storage.
When you said "you," it implied (to me, at least) a home system, since
we're talking about a home system from the start. Certainly, if it is a
system that a company uses for its data, all of what you say is correct.
But a company, regardless of size, can write these expenses off.
Individuals cannot do that with their home systems. For them, this
paradigm is much more vague if it exists at all.
So, while I was talking apples, you were talking parsnips. My apologies
for not making that clearer. (All of that said, the DVD drive has been
acting up. Perhaps a writable Blu-Ray is in the wind. Since the price of
them has dropped further than the price of oil, that may make backups of
the more important data possible.)
The
costs dont stop there. There is another 200TB of net storage dedicated
to holding enough log data to rebuild the last 18 months from scratch. I
also have two sets of slaves that snapshot themselves frequently. One
set is a single disk, the other is raidz. These are not just backups.
One set runs batch jobs, one runs the front end portal, and the masters
are in charge of data ingestions.
Don't forget the costs added on by off-site storage, etc. I don't care
how many times the data is backed up, if it's all in the same building
that just burned to the ground... That is, unless your zfs sends are
going to a different site...
If you dont backup, you set yourself up for unrecoverable problems. In
<snip>
I believe this may be the first time (for me) that simply replacing a
failed drive resulted in data corruption in a zpool. I've certainly
never seen this level of mess before.
That said, instead of running mirrors run loose disks and backup to the
second pool at a frequency you are comfortable with. You need to
prioritize your resources against your risk tolerance. It is tempting to
do mirrors because it is sexy but that might not be the best strategy.
That is something for me to think about. (I don't do *anything* on
computers because it's "sexy." I did mirrors for security (remember,
they hadn't failed for me at such a monumental level previously).
That's an arrogant statement, presuming that if a person doesn't have
gobs of money, they shouldn't bother with computers at all.
I didnt write anything like that. What I am saying is you need to get
more creative on how to protect your data. Yes, money makes it easier
but you have options.
My apologies; on its own, it came across that way.
I am not complaining about the time it takes; I know full well how
long it can take. I am complaining that the "resilvering" stops dead.
(More on this below.)
When the scrub is stopped dead, what does "iostat -nMxC 1" look like?
Are there drives indicating 100% busy? high wait or asvc_t times?
sudo iostat -nMxC 1
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
23.3 55.2 0.6 0.3 0.2 0.3 2.1 4.5 5 27 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.5 0 0 c6d1
360.1 13.1 29.0 0.1 1.3 1.5 3.4 4.0 48 82 c6d0
9.7 330.9 0.0 29.1 0.1 0.6 0.3 1.6 9 52 c7d1
359.9 354.6 28.3 28.5 30.2 3.4 42.2 4.7 85 85 data
23.2 34.9 0.6 0.3 6.2 0.3 106.9 5.6 6 12 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 112.1 0.0 0.3 0.0 0.4 0.0 4.0 0 45 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
71.0 10.0 2.3 0.0 1.6 1.1 19.8 14.0 54 60 c6d0
40.0 44.0 0.1 2.2 0.2 1.1 1.8 12.8 12 83 c7d1
111.1 58.0 2.4 2.2 18.9 3.5 112.0 20.6 54 54 data
0.0 58.0 0.0 0.3 0.0 0.0 0.0 0.6 0 3 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 187.1 0.0 0.7 0.0 0.8 0.0 4.1 0 74 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
403.1 0.0 32.9 0.0 1.2 1.8 2.9 4.6 53 97 c6d0
12.0 386.1 0.0 32.9 0.2 0.5 0.4 1.3 12 44 c7d1
415.1 386.1 33.0 32.9 27.6 3.9 34.5 4.9 100 100 data
0.0 98.0 0.0 0.7 0.0 0.1 0.0 1.1 0 8 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 60.0 0.0 0.7 0.0 0.1 0.1 1.8 0 11 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
399.9 0.0 33.4 0.0 0.7 1.8 1.8 4.6 39 97 c6d0
0.0 401.9 0.0 33.2 0.1 0.4 0.2 1.0 7 40 c7d1
399.9 401.9 33.4 33.2 27.3 3.2 34.0 4.0 100 100 data
0.0 58.0 0.0 0.7 0.4 0.0 7.1 0.6 3 3 rpool
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c4t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3d1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c6d1
381.0 0.0 32.3 0.0 0.9 1.8 2.3 4.8 44 96 c6d0
0.0 384.0 0.0 31.8 0.1 0.4 0.2 1.1 6 42 c7d1
381.0 384.0 32.3 31.8 26.6 3.3 34.8 4.4 100 100 data
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 rpool
So, something IS actually happening, it would seem.
Do you have any controller errors? Does iostat -En report any errors?
sudo iostat -En
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3500514NS Revision: Serial No: 9WJ Size:
500.10GB <500101152768 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c3d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000NC000 Revision: Serial No: Z1F Size:
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST4000DM000-1F2 Revision: Serial No: S30 Size:
4000.74GB <4000743161856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c6d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST32000542AS Revision: Serial No: 5XW Size:
2000.37GB <2000371580928 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c7d1 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: ST3000DM001-1ER Revision: Serial No: Z7P Size:
3000.59GB <3000590401536 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
c4t0d0 Soft Errors: 0 Hard Errors: 6 Transport Errors: 0
Vendor: TSSTcorp Product: CDDVDW SH-S222A Revision: SB02 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 6 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
Have you tried mounting the pool ro, stopping the scrub, and then
copying data off?
It *seems* to be ro now, but I can't be sure. I did:
sudo zfs set readonly=on data
It paused for a few seconds, then gave me a prompt back. It didn't spit
out any errors. But any of my further commands, like trying a copy, froze.
Here are some hail mary settings that probably wont help. I offer them
(in no particular order) to try to improve the scrub time performance,
minimized the number of enqueued I/Os in case that is exacerbating the
problem some how, and attempt to limit the amount of time spent on a
failing I/O. Your scrubs may be stopping because you have a disk that
exhibiting a poor failure mode. Namely, some sort of internal error and
it just keeps retrying which makes the pool wedge. WD is not the brand I
go to for enterprise failure modes.
Trust me, I haven't let WD drives anywhere near my computers for quite
some time. (Ironically enough, it was WD drives that, in the early days
of this system, showed me how resilient zfs mirrors were.) They were
replaced by SeaGates, Hitachi drives, or whatever they had that wasn't
WD. I'd rather have trolls hand-chiseling the data into rocks.
* dont spend more than 8ms on any single i/o
set sd:sd_io_time=8
* resilver in 5 second intervals minimum
set zfs:zfs_resilver_min_time_ms = 5000
set zfs:zfs_resilver_delay = 0
* enqueue only 5 I/Os
set zfs:zfs_top_maxinflight = 2
Thanks, but these all fail with the "pool I/O is currently suspended"
error.
Apply these settings and try to resilver again. If this doesnt work, dd
the drives to new ones. Using dd will likely identify which drive is
wedging ZFS as it will either not complete or it will error out.
I am not sure who the tech is, but at least two people on this list told
you check the CMOS battery. I think Bob and I both recommended changing
the battery. Others might have as well.
The tech works for the company that sold the system to me, and holds the
warranty for it. I no longer have the ability or hardware/tools to
change the CMOS battery myself. That, plus the system being under
warranty... I did mention this to the guy I was dealing with, but I got
the distinct feeling the either: a) he never told the tech, or b) the
tech didn't believe it. IIRC, when I _finally_ talked to the tech, he
sounded quite surprised that the battery was dieing, and that it
affected the system so badly. This is one of the joys of doing this
stuff over the phone.
I may be wrong, though. There has been so much crappola happening in the
last year... :-(
I reviewed your output again. You have two disks in a mirror. Each disk
is resilvering. This means the two disks are resilvering from each
other. There are no other options as there are no other vdevs in the
pool.
I can see the new disk resilvering from the old one, but why did ZFS
start resilvering the old disk from the new one? Shouldn't it have spat
out a "corrupt data" error and forced me to scrub? This odd state is how
the system booted up. I would obviously have lost some data, but after
the scrub, the pool would at least have been functional enough to do a
replace (and then resilvered).
ZFS suspects both disks could be dirty,
I can go along with that. Some data on the original drive is corrupt,
and the new disk doesn't have any data (a special form of "corrupt").
that's why it is resilvering
both drives. This puts the drives under heavy load. This load is
probably surfacing an internal error on one or more drives but because
WD has crappy failure modes it is not sending the error to OS.
Internally, the drive keeps retrying with errors and the OS keeps
waiting for the drive to return from write flush. This is likely what is
wedging the pool. The problem is likely on the drive -- but I cant say
that with certainty. Certainty is a big word.
Again, no WD cow patties
There is another option, which has the potential to make your data
diverge across the two disks if you dont mount them read-only.
Basically, reboot the system with just one of the vdevs installed and
mounted read-only.
I'm going to try this route, and see what I can get it to do. So far, it
hasn't locked up on a command. There's something curious, though. When I
try a zpool status -v,it tells me:
errors: List of errors unavailable (insufficient privileges)
It gives me that when running under my user ID, doing it via sudo, and
even when I su to root.The *first* time I ran it was using sudo, and it
told me there are 50056 data errors. Every time I run it again, this
message is not given; only the very first time after the boot-up.
If you are in the SF bay area and want to bring it by my office I am
happy to take a stab it, after you back up your original disks (for
liability reasons). I can provide a clean working "text book" system if
needed, bring your system and drives and we can likely salvage it one
way or another.
Thank you very much for the offer, but I'm a couple thousand miles or so
north of you.
I have noticed one thing, though; the resilvering numbers _are_ actually
increasing, now. Since the original disk (all others disconnected) is
actually showing a change since yesterday, I'm going to pack it in for
the night, and see where the count has gotten to tomorrow evening. It
says: "5h11m to go", but I strongly suspect it will be longer. I'll make
a note of where it's at right now.
Rainer