I backported the hack from 9.0.17-1, now under same conditions node
still floods the log with "al_comlete_io()" and "LOGIC BUG" messages,
but at least it does not crash.
Ing. Evzen Demcenko
Senior Linux Administrator
Cluster Design s.r.o.
On 5/13/19 6:43 PM, Evzen Demcenko wrote:
If drbd node in primary-primary setup looses disk by any reason
(faulty disk, controller etc., or even manual detach) and there is R/W
activity on both nodes, the second node eventually crashes with kernel
panic "kernel BUG at
/root/rpmbuild/BUILD/drbd-8.4.11-1/drbd/lru_cache.c:570!"
Before the crash there are a lot of messages in kernel.log on good
node (with attached disk) like
block drbd1: al_complete_io() called on inactive extent 57
block drbd1: LOGIC BUG for enr=74
Eventually, the "good" node crashes within minutes or hours depending
on disk activity leaving the cluster without any data.
I tested different versions from 8.4 tree (8.4.6, 8.4.7-1, 8.4.9-1,
8.4.11-1), always with the same result.
There is also no difference on "real-hardware" and virtualized machines.
kernel is 2.6.32-754.12.1.el6.x86_64 on Centos-6.10 with latest
updates. Tested also on other kernels with the same outcome.
There is a vmcore-dmesg.txt attached to this email, vmcore itself is
available as well for every tested version, core files are 35-50Mb, so
i can't attach them to email, but i'll be glad to share them in some
other way if needed.
[root@drtest-11 ~]# cat /etc/drbd.d/global_common.conf
global {
usage-count yes;
}
common {
protocol C;
handlers {
pri-on-incon-degr
"/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh";
pri-lost-after-sb
"/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh";
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh";
}
startup {
# wfc-timeout degr-wfc-timeout outdated-wfc-timeout
wait-after-sb
}
disk {
resync-rate 100M;
on-io-error detach;
al-extents 1447;
c-plan-ahead 32;
c-max-rate 1000M;
c-min-rate 80M;
c-fill-target 65536k;
}
net {
sndbuf-size 4096k;
rcvbuf-size 4096k;
timeout 100; # 10 seconds (unit = 0.1 seconds)
connect-int 15; # 15 seconds (unit = 1 second)
ping-int 15; # 15 seconds (unit = 1 second)
ping-timeout 50; # 5000 ms (unit = 0.1 seconds)
max-buffers 131072;
max-epoch-size 20000;
ko-count 0;
after-sb-0pri discard-younger-primary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
rr-conflict disconnect;
}
}
[root@drtest-11 ~]# cat /etc/drbd.d/r1.res
resource r1 {
net {
protocol C;
allow-two-primaries;
verify-alg crc32c;
csums-alg crc32c;
}
startup {
become-primary-on both;
}
disk {
disk-timeout 1200;
}
on drtest-11.uvt.internal {
device /dev/drbd1;
disk "/dev/vdb";
address 10.0.11.201:7790;
flexible-meta-disk internal;
}
on drtest-12.uvt.internal {
device /dev/drbd1;
disk "/dev/vdb";
address 10.0.11.202:7790;
flexible-meta-disk internal;
}
}
How to reproduce:
After create-md, connect, primary etc.:
On drtest-11:
pvcreate /dev/drbd1
vgcreate test /dev/drbd1
lvcreate -n t1 -L20g test
lvcreate -n t2 -L20g test
mkfs.ext4 /dev/test/t1
mount /dev/test/t1 /mnt/t1
mkdir -m 777 /mnt/t1/test
while true ; do bonnie++ -u nobody -d /mnt/t1/test/ -n 8192 -s8192 ; done
On drtest-12
vgchange -aly
mkfs.ext4 /dev/test/t2
mount /dev/test/t2 /mnt/t2
mkdir -m 777 /mnt/t2/test
while true ; do bonnie++ -u nobody -d /mnt/t2/test/ -n 8192 -s8192 ; done
drbdadm detach r1
After the detach on drtest-12, drtest-11 almost instantly starts
flooding the log with " al_complete_io() called on inactive extent"
and "LOGIC BUG for enr=" and crashes within couple of minutes.
Thanks in advance for investigating this issue.
Sincerely,
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user
diff -p -r -U3 old/drbd/drbd_actlog.c new/drbd/drbd_actlog.c
--- old/drbd/drbd_actlog.c 2018-04-10 13:38:15.000000000 +0200
+++ new/drbd/drbd_actlog.c 2019-05-14 18:24:17.000000000 +0200
@@ -593,7 +593,7 @@ void drbd_al_complete_io(struct drbd_dev
for (enr = first; enr <= last; enr++) {
extent = lc_find(device->act_log, enr);
- if (!extent) {
+ if (!extent || extent->refcnt == 0) {
drbd_err(device, "al_complete_io() called on inactive extent %u\n", enr);
continue;
}
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user