I backported the hack from 9.0.17-1, now under same conditions node still floods the log with "al_comlete_io()" and "LOGIC BUG" messages, but at least it does not crash.

Ing. Evzen Demcenko
Senior Linux Administrator
Cluster Design s.r.o.

On 5/13/19 6:43 PM, Evzen Demcenko wrote:
If drbd node in primary-primary setup looses disk by any reason (faulty disk, controller etc., or even manual detach) and there is R/W activity on both nodes, the second node eventually crashes with kernel panic "kernel BUG at /root/rpmbuild/BUILD/drbd-8.4.11-1/drbd/lru_cache.c:570!" Before the crash there are a lot of messages in kernel.log on good node (with attached disk) like

block drbd1: al_complete_io() called on inactive extent 57
block drbd1: LOGIC BUG for enr=74

Eventually, the "good" node crashes within minutes or hours depending on disk activity leaving the cluster without any data. I tested different versions from 8.4 tree (8.4.6, 8.4.7-1, 8.4.9-1, 8.4.11-1), always with the same result.
There is also no difference on "real-hardware" and virtualized machines.
kernel is 2.6.32-754.12.1.el6.x86_64 on Centos-6.10 with latest updates. Tested also on other kernels with the same outcome. There is a vmcore-dmesg.txt attached to this email, vmcore itself is available as well for every tested version, core files are 35-50Mb, so i can't attach them to email, but i'll be glad to share them in some other way if needed.

[root@drtest-11 ~]# cat /etc/drbd.d/global_common.conf
global {
        usage-count yes;
}

common {
        protocol C;

        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh";                 pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh";                 local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh";
        }

        startup {
                # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
        }

        disk {
                resync-rate 100M;
                on-io-error detach;
                al-extents 1447;
                c-plan-ahead 32;
                c-max-rate 1000M;
                c-min-rate 80M;
                c-fill-target 65536k;
        }

        net {
                sndbuf-size 4096k;
                rcvbuf-size 4096k;
                timeout       100;    #  10 seconds  (unit = 0.1 seconds)
                connect-int   15;    # 15 seconds  (unit = 1 second)
                ping-int      15;    # 15 seconds  (unit = 1 second)
                ping-timeout  50;    # 5000 ms (unit = 0.1 seconds)
                max-buffers     131072;
                max-epoch-size  20000;
                ko-count 0;
                after-sb-0pri discard-younger-primary;
                after-sb-1pri consensus;
                after-sb-2pri disconnect;
                rr-conflict disconnect;
        }
}

[root@drtest-11 ~]# cat /etc/drbd.d/r1.res
resource r1 {
    net {
        protocol C;
        allow-two-primaries;
        verify-alg crc32c;
        csums-alg crc32c;
    }
    startup {
        become-primary-on both;
    }
  disk {
      disk-timeout 1200;
  }
  on drtest-11.uvt.internal {
        device      /dev/drbd1;
        disk        "/dev/vdb";
        address     10.0.11.201:7790;
        flexible-meta-disk internal;
    }
  on drtest-12.uvt.internal {
        device      /dev/drbd1;
        disk        "/dev/vdb";
        address     10.0.11.202:7790;
        flexible-meta-disk internal;
    }
}

How to reproduce:
After create-md, connect, primary etc.:

On drtest-11:
pvcreate /dev/drbd1
vgcreate test /dev/drbd1
lvcreate -n t1 -L20g test
lvcreate -n t2 -L20g test
mkfs.ext4 /dev/test/t1
mount /dev/test/t1 /mnt/t1
mkdir -m 777 /mnt/t1/test
while true ; do bonnie++ -u nobody -d /mnt/t1/test/ -n 8192 -s8192 ; done

On drtest-12
vgchange -aly
mkfs.ext4 /dev/test/t2
mount /dev/test/t2 /mnt/t2
mkdir -m 777 /mnt/t2/test
while true ; do bonnie++ -u nobody -d /mnt/t2/test/ -n 8192 -s8192 ; done
drbdadm detach r1

After the detach on drtest-12, drtest-11 almost instantly starts flooding the log with " al_complete_io() called on inactive extent" and "LOGIC BUG for enr=" and crashes within couple of minutes.

Thanks in advance for investigating this issue.
Sincerely,


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

diff -p -r -U3 old/drbd/drbd_actlog.c new/drbd/drbd_actlog.c
--- old/drbd/drbd_actlog.c	2018-04-10 13:38:15.000000000 +0200
+++ new/drbd/drbd_actlog.c	2019-05-14 18:24:17.000000000 +0200
@@ -593,7 +593,7 @@ void drbd_al_complete_io(struct drbd_dev
 
 	for (enr = first; enr <= last; enr++) {
 		extent = lc_find(device->act_log, enr);
-		if (!extent) {
+		if (!extent || extent->refcnt == 0) {
 			drbd_err(device, "al_complete_io() called on inactive extent %u\n", enr);
 			continue;
 		}
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to