Hello Andy,
Thank you very much for your lengthy and very informative answer.
After some investigation, I discovered that it was /dev/sdc that had
some problems. So I took it out of the Rais 1 array. But this didn't
really help since I got other freeze.
grep "120 seconds" kern.log
Feb 18 16:16:38 box kernel: [30209.474017] INFO: task md1_raid1:467 blocked for
more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474151] INFO: task md0_raid1:470 blocked for
more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474250] INFO: task jbd2/md0-8:982 blocked
for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474447] INFO: task jbd2/md1-8:988 blocked
for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474721] INFO: task configmgrWriter:26206
blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.474944] INFO: task kworker/u56:1:25006
blocked for more than 120 seconds.
Feb 18 16:16:38 box kernel: [30209.475150] INFO: task kworker/u56:2:26207
blocked for more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.307956] INFO: task md1_raid1:467 blocked for
more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.308088] INFO: task md0_raid1:470 blocked for
more than 120 seconds.
Feb 18 16:18:39 box kernel: [30330.308188] INFO: task jbd2/md0-8:982 blocked
for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.751926] INFO: task md0_raid1:412 blocked for
more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752059] INFO: task md1_raid1:416 blocked for
more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752158] INFO: task jbd2/md1-8:988 blocked
for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752348] INFO: task jbd2/md0-8:993 blocked
for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752513] INFO: task uptimed:1174 blocked for
more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752743] INFO: task fetchmail:3121 blocked
for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.752990] INFO: task offlineimap:4247 blocked
for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.753195] INFO: task kworker/u56:0:10116
blocked for more than 120 seconds.
Feb 19 11:03:22 box kernel: [ 8217.753390] INFO: task kworker/u56:2:11869
blocked for more than 120 seconds.
Feb 19 11:05:22 box kernel: [ 8338.585502] INFO: task md0_raid1:412 blocked for
more than 120 seconds.
On Fri, Feb 15, 2019 at 09:35:27AM +0100, steve wrote:
>for i in /dev/sd{b..f}; do echo "DISK: ${i}"; smartctl -l scterc "${i}"; sleep
3; done
I get this for sdb and sdc
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
and this for sdf
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
What does it tell me ?
It means that sd[bc] may support SCTERC but it's disabled (promising),
and sdf does support it and it's set to 7 seconds (good).
For disks in Linux software RAID, SCTERC with a low timeout is
essential. If it's not possible then the block layer timeout for the
device should be increased.
You should try to set SCTERC for sd[bc] like so:
# for dev in /dev/sd[cd]; do smartctl -l scterc,70,70 "$dev"; done
I tried this:
smartctl -l scterc,70,70 /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
SCT Error Recovery Control set to:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
But then
smartctl -l scterc /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.1-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Unexpected SCT status 0x0046 (action_code=3, function_code=2)
SCT (Get) Error Recovery Control command failed
Which is weird…
If that works then great - all your drives support SCTERC and have low
timeouts.
If setting it to 70 (centiseconds, so 7 seconds) doesn't work then you
will need to increase the block layer timeout like this:
cat /sys/block/sdb/device/timeout
30
echo 180 > /sys/block/sdb/device/timeout
Let's see if it helps.
I am here in a field that I don't master at all, so just following your advices.
Will let you know.
Thank you
Best,
Steve