Hi,
today i noticed a problem on my two Heartbeat / DRBD Servers.
on each server there are 2 primary drbd devices
on th-dus-mqm:
drbd0 / drbd2
on th-fra-mqm:
drbd1 / drbd3
if th-dus-mqm fails, drbd0 and drbd2 failover to th-fra-mqm. That normally
works fine.
Today i tried to stop heartbeat manually on both servers for testing:
/etc/inint.d/heartbeat stop
then i noticed this errors in /var/log/ha-log (in both servers):
---
heartbeat[2834]: 2009/01/12_22:35:08 info: Heartbeat shutdown in progress.
(2834)
heartbeat[4630]: 2009/01/12_22:35:08 info: Giving up all HA resources.
ResourceManager[4643]: 2009/01/12_22:35:08 info: Releasing resource group:
th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0
Filesystem::/dev/drbd0::/du
s::ext3 drbddisk::drbd2 Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
ResourceManager[4643]: 2009/01/12_22:35:08 info: Running
/etc/ha.d/resource.d/mqm_dus stop
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd2 /home/tbmx/dus ext3 stop
Filesystem[5005]: 2009/01/12_22:35:09 INFO: Running stop for /dev/drbd2
on /home/tbmx/dus
Filesystem[4994]: 2009/01/12_22:35:09 INFO: Success
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/drbddisk drbd2 stop
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /dus ext3 stop
Filesystem[5107]: 2009/01/12_22:35:09 INFO: Running stop for /dev/drbd0
on /dus
Filesystem[5096]: 2009/01/12_22:35:09 INFO: Success
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/drbddisk drbd0 stop
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/IPaddr 92.254.37.53 stop
IPaddr[5200]: 2009/01/12_22:35:09 INFO: Success
ResourceManager[4643]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/IPaddr 10.10.121.130 stop
IPaddr[5258]: 2009/01/12_22:35:09 INFO: Success
ResourceManager[5295]: 2009/01/12_22:35:09 info: Releasing resource group:
th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1
Filesystem::/dev/drbd1::/fr
a::ext3 drbddisk::drbd3 Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
ResourceManager[5295]: 2009/01/12_22:35:09 info: Running
/etc/ha.d/resource.d/mqm_fra stop
ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd3 /home/tbmx/fra ext3 stop
Filesystem[5553]: 2009/01/12_22:35:15 INFO: Running stop for /dev/drbd3
on /home/tbmx/fra
Filesystem[5553]: 2009/01/12_22:35:15 INFO: Trying to unmount
/home/tbmx/fra
Filesystem[5553]: 2009/01/12_22:35:15 INFO: unmounted /home/tbmx/fra
successfully
Filesystem[5542]: 2009/01/12_22:35:15 INFO: Success
ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
/etc/ha.d/resource.d/drbddisk drbd3 stop
ResourceManager[5295]: 2009/01/12_22:35:15 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
Filesystem[5671]: 2009/01/12_22:35:15 INFO: Running stop for /dev/drbd1
on /fra
Filesystem[5671]: 2009/01/12_22:35:15 INFO: Trying to unmount /fra
Filesystem[5671]: 2009/01/12_22:35:15 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5671]: 2009/01/12_22:35:15 INFO: Some processes on /fra were
signalled
Filesystem[5671]: 2009/01/12_22:35:16 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5671]: 2009/01/12_22:35:16 INFO: Some processes on /fra were
signalled
Filesystem[5671]: 2009/01/12_22:35:17 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5671]: 2009/01/12_22:35:17 INFO: Some processes on /fra were
signalled
Filesystem[5671]: 2009/01/12_22:35:18 ERROR: Couldn't unmount /fra;
trying cleanup with SIGKILL
Filesystem[5671]: 2009/01/12_22:35:18 INFO: Some processes on /fra were
signalled
Filesystem[5671]: 2009/01/12_22:35:19 ERROR: Couldn't unmount /fra;
trying cleanup with SIGKILL
Filesystem[5671]: 2009/01/12_22:35:20 INFO: No processes on /fra were
signalled
Filesystem[5671]: 2009/01/12_22:35:21 ERROR: Couldn't unmount /fra,
giving up!
Filesystem[5660]: 2009/01/12_22:35:21 ERROR: Generic error
ResourceManager[5295]: 2009/01/12_22:35:21 ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem
ResourceManager[5295]: 2009/01/12_22:35:22 info: Retrying failed stop
operation [Filesystem::/dev/drbd1::/fra::ext3]
ResourceManager[5295]: 2009/01/12_22:35:22 info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd1 /fra ext3 stop
Filesystem[5839]: 2009/01/12_22:35:22 INFO: Running stop for /dev/drbd1
on /fra
Filesystem[5839]: 2009/01/12_22:35:22 INFO: Trying to unmount /fra
Filesystem[5839]: 2009/01/12_22:35:22 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5839]: 2009/01/12_22:35:22 INFO: No processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:23 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5839]: 2009/01/12_22:35:23 INFO: No processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:24 ERROR: Couldn't unmount /fra;
trying cleanup with SIGTERM
Filesystem[5839]: 2009/01/12_22:35:24 INFO: No processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:25 ERROR: Couldn't unmount /fra;
trying cleanup with SIGKILL
Filesystem[5839]: 2009/01/12_22:35:25 INFO: Some processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:26 ERROR: Couldn't unmount /fra;
trying cleanup with SIGKILL
Filesystem[5839]: 2009/01/12_22:35:26 INFO: No processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:27 ERROR: Couldn't unmount /fra;
trying cleanup with SIGKILL
Filesystem[5839]: 2009/01/12_22:35:28 INFO: No processes on /fra were
signalled
Filesystem[5839]: 2009/01/12_22:35:29 ERROR: Couldn't unmount /fra,
giving up!
Filesystem[5828]: 2009/01/12_22:35:29 ERROR: Generic error
.......
ResourceManager[5295]: 2009/01/12_22:36:36 ERROR: Return code 1 from
/etc/ha.d/resource.d/Filesystem
Filesystem[9851]: 2009/01/12_22:36:36 INFO: Running OK
ResourceManager[5295]: 2009/01/12_22:36:36 CRIT: Resource STOP failure. Reboot
required!
ResourceManager[5295]: 2009/01/12_22:36:36 CRIT: Killing heartbeat
ungracefully!
---
after that the server does a reboot. After the reboot everything is working
fine again
i dont know why he is not able to unmount the device correct. Sometimes i can
stop heartbeat without errors and sometimes not.
my haresources file:
---
th-dus-mqm 10.10.121.130 92.254.37.53 drbddisk::drbd0
Filesystem::/dev/drbd0::/dus::ext3 drbddisk::drbd2
Filesystem::/dev/drbd2::/home/tbmx/dus::ext3 mqm_dus
th-fra-mqm 10.10.121.131 92.254.37.54 drbddisk::drbd1
Filesystem::/dev/drbd1::/fra::ext3 drbddisk::drbd3
Filesystem::/dev/drbd3::/home/tbmx/fra::ext3 mqm_fra
---
my ha.cf:
---
node th-dus-mqm th-fra-mqm
ucast bond0.121 10.10.121.132
ucast bond0.121 10.10.121.133
auto_failback off
debugfile /var/log/ha-debug
logfile /var/log/ha-log
warntime 3
deadtime 6
initdead 60
keepalive 2
---
my drbd.conf:
---
resource drbd0 {
protocol C;
startup {
become-primary-on th-dus-mqm;
}
syncer {
rate 50M;
}
net {
allow-two-primaries;
}
on th-dus-mqm {
device /dev/drbd0;
disk /dev/sda10;
address 10.10.121.132:7766;
meta-disk internal;
}
on th-fra-mqm {
device /dev/drbd0;
disk /dev/sda10;
address 10.10.121.133:7766;
meta-disk internal;
}
}
resource drbd1 {
protocol C;
startup {
become-primary-on th-fra-mqm;
}
syncer {
rate 50M;
}
net {
allow-two-primaries;
}
on th-dus-mqm {
device /dev/drbd1;
disk /dev/sda11;
address 10.10.121.132:7776;
meta-disk internal;
}
on th-fra-mqm {
device /dev/drbd1;
disk /dev/sda11;
address 10.10.121.133:7776;
meta-disk internal;
}
}
resource drbd2 {
protocol C;
startup {
become-primary-on th-dus-mqm;
}
syncer {
rate 50M;
}
net {
allow-two-primaries;
}
on th-dus-mqm {
device /dev/drbd2;
disk /dev/sda12;
address 10.10.121.132:7786;
meta-disk internal;
}
on th-fra-mqm {
device /dev/drbd2;
disk /dev/sda12;
address 10.10.121.133:7786;
meta-disk internal;
}
}
resource drbd3 {
protocol C;
startup {
become-primary-on th-fra-mqm;
}
syncer {
rate 50M;
}
net {
allow-two-primaries;
}
on th-dus-mqm {
device /dev/drbd3;
disk /dev/sda13;
address 10.10.121.132:7796;
meta-disk internal;
}
on th-fra-mqm {
device /dev/drbd3;
disk /dev/sda13;
address 10.10.121.133:7796;
meta-disk internal;
}
}
---
I hope you guys can help me with my Problem.
Thanks in advanced.
Kind regards
Sebastian_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems