Hi,
i'm running a ceph cluster with 4x ISCSI exporter nodes and oVirt on the client
side. In the tcmu-runner logs i the the following happening every few seconds:
###
2019-10-22 10:11:11.231 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired
exclusive lock.
2019-10-22 10:11:11.395 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:12.346 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0:
Async lock drop. Old state 1
2019-10-22 10:11:12.353 1710 [INFO] alua_implicit_transition:566
rbd/image.lun0: Starting lock acquisition operation.
2019-10-22 10:11:13.325 1710 [INFO] alua_implicit_transition:566
rbd/image.lun0: Starting lock acquisition operation.
2019-10-22 10:11:13.852 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:13.854 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:13.863 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun1:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:14.202 1710 [INFO] alua_implicit_transition:566
rbd/image.lun0: Starting lock acquisition operation.
2019-10-22 10:11:14.285 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:15.217 1710 [WARN] tcmu_rbd_lock:762 rbd/image.lun0: Acquired
exclusive lock.
2019-10-22 10:11:15.873 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0:
Async lock drop. Old state 1
2019-10-22 10:11:16.696 1710 [INFO] alua_implicit_transition:566
rbd/image.lun0: Starting lock acquisition operation.
2019-10-22 10:11:16.696 1710 [WARN] tcmu_notify_lock_lost:222 rbd/image.lun0:
Async lock drop. Old state 2
2019-10-22 10:11:16.992 1710 [ERROR] tcmu_rbd_has_lock:516 rbd/image.lun2:
Could not check lock ownership. Error: Cannot send after transport endpoint
shutdown.
###
This happens on all of my four iscsi exporter nodes. Blacklist gives me the
following (number of blacklisted objects does not really shrink):
###
ceph osd blacklist ls
listed 10579 entries
###
On the client site i configured the multipath config like this:
###
device {
vendor "LIO-ORG"
hardware_handler "1 alua"
path_grouping_policy "failover"
path_selector "queue-length 0"
failback 60
path_checker tur
prio alua
prio_args exclusive_pref_bit
fast_io_fail_tmo 25
no_path_retry queue
}
###
And multipath -ll shows me all four path as "active ready" without errors.
For me this looks like the tcmu-runner cannot aquire exclusive lock and it is
flapping between nodes. In addition, in the ceph gui / dashboard i can see the
LUNs in the "active / optimized" state are flapping between nodes ...
I'm have installed the following versions (CentOS 7.7, Ceph 13.2.6):
###
rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
python-cephfs-13.2.6-0.el7.x86_64
ceph-selinux-13.2.6-0.el7.x86_64
kernel-3.10.0-957.5.1.el7.x86_64
kernel-3.10.0-957.1.3.el7.x86_64
kernel-tools-libs-3.10.0-1062.1.2.el7.x86_64
libcephfs2-13.2.6-0.el7.x86_64
libtcmu-1.4.0-106.gd17d24e.el7.x86_64
ceph-common-13.2.6-0.el7.x86_64
ceph-osd-13.2.6-0.el7.x86_64
tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
kernel-3.10.0-1062.1.2.el7.x86_64
ceph-iscsi-3.3-1.el7.noarch
kernel-headers-3.10.0-1062.1.2.el7.x86_64
kernel-3.10.0-862.14.4.el7.x86_64
ceph-base-13.2.6-0.el7.x86_64
kernel-tools-3.10.0-1062.1.2.el7.x86_64
###
Greets,
Kilian
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com