Hi, I've been look at ceph mds perf counters and I saw the one of my clusters was hugely different from other in number of caps:
rlat inos caps | hsr hcs hcr | writ read actv | recd recy stry purg | segs evts subm 0 3.0M 5.1M | 0 0 595 | 304 4 0 | 0 0 13k 0 | 42 35k 893 0 3.0M 5.1M | 0 0 165 | 1.8k 4 37 | 0 0 13k 0 | 43 36k 302 16 3.0M 5.1M | 0 0 429 | 247 9 4 | 0 0 13k 58 | 38 32k 1.7k 0 3.0M 5.1M | 0 1 213 | 1.2k 0 857 | 0 0 13k 0 | 40 33k 766 23 3.0M 5.1M | 0 0 945 | 445 1 0 | 0 0 13k 0 | 41 34k 1.1k 0 3.0M 5.1M | 0 2 696 | 376 11 0 | 0 0 13k 0 | 43 35k 1.0k 3 2.9M 5.1M | 0 0 601 | 2.0k 6 0 | 0 0 13k 56 | 38 29k 1.2k 0 2.9M 5.1M | 0 0 394 | 272 11 0 | 0 0 13k 0 | 38 30k 758 on another cluster running the same version: -----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log---- rlat inos caps | hsr hcs hcr | writ read actv | recd recy stry purg | segs evts subm 2 3.9M 380k | 0 1 266 | 1.8k 0 370 | 0 0 24k 44 | 37 129k 1.5k I did a perf dump on the active mds: ~# ceph daemon mds.a perf dump mds { "mds": { "request": 2245276724, "reply": 2245276366, "reply_latency": { "avgcount": 2245276366, "sum": 18750003.074118977 }, "forward": 0, "dir_fetch": 20217943, "dir_commit": 555295668, "dir_split": 0, "inode_max": 3000000, "inodes": 3000276, "inodes_top": 152555, "inodes_bottom": 279938, "inodes_pin_tail": 2567783, "inodes_pinned": 2782064, "inodes_expired": 308697104, "inodes_with_caps": 2779658, "caps": 5147887, "subtrees": 2, "traverse": 2582452087, "traverse_hit": 2338123987, "traverse_forward": 0, "traverse_discover": 0, "traverse_dir_fetch": 16627249, "traverse_remote_ino": 29276, "traverse_lock": 2507504, "load_cent": 18446743868740589422, "q": 27, "exported": 0, "exported_inodes": 0, "imported": 0, "imported_inodes": 0 } } and then a session ls to see what clients could be holding that much: { "client_metadata" : { "entity_id" : "admin", "kernel_version" : "4.4.0-97-generic", "hostname" : "suppressed" }, "completed_requests" : 0, "id" : 1165169, "num_leases" : 343, "inst" : "client.1165169 10.0.0.112:0/982172363", "state" : "open", "num_caps" : 111740, "reconnecting" : false, "replay_requests" : 0 }, { "state" : "open", "replay_requests" : 0, "reconnecting" : false, "num_caps" : 108125, "id" : 1236036, "completed_requests" : 0, "client_metadata" : { "hostname" : "suppressed", "kernel_version" : "4.4.0-97-generic", "entity_id" : "admin" }, "num_leases" : 323, "inst" : "client.1236036 10.0.0.113:0/1891451616" }, { "num_caps" : 63186, "reconnecting" : false, "replay_requests" : 0, "state" : "open", "num_leases" : 147, "completed_requests" : 0, "client_metadata" : { "kernel_version" : "4.4.0-75-generic", "entity_id" : "admin", "hostname" : "suppressed" }, "id" : 1235930, "inst" : "client.1235930 10.0.0.110:0/2634585537" }, { "num_caps" : 2476444, "replay_requests" : 0, "reconnecting" : false, "state" : "open", "num_leases" : 0, "completed_requests" : 0, "client_metadata" : { "entity_id" : "admin", "kernel_version" : "4.4.0-75-generic", "hostname" : "suppressed" }, "id" : 1659696, "inst" : "client.1659696 10.0.0.101:0/4005556527" }, { "state" : "open", "replay_requests" : 0, "reconnecting" : false, "num_caps" : 2386376, "id" : 1069714, "client_metadata" : { "hostname" : "suppressed", "kernel_version" : "4.4.0-75-generic", "entity_id" : "admin" }, "completed_requests" : 0, "num_leases" : 0, "inst" : "client.1069714 10.0.0.111:0/1876172355" }, { "replay_requests" : 0, "reconnecting" : false, "num_caps" : 1726, "state" : "open", "inst" : "client.8394 10.0.0.103:0/3970353996", "num_leases" : 0, "id" : 8394, "client_metadata" : { "entity_id" : "admin", "kernel_version" : "4.4.0-75-generic", "hostname" : "suppressed" }, "completed_requests" : 0 } Surprisingly, the 2 hosts that were holding 2M+ caps were the ones not in use. Cephfs was mounted but nothing was using the dirs. I did mount -o remount cephfs on those 2 hosts and, after that, caps dropped significantly to less than 300k. "caps": 288489 So, questions: does that really matter? What are possible impacts? What could have caused this 2 hosts to hold so many capabilities? 1 of the hosts are for tests purposes, traffic is close to zero. The other host wasn't using cephfs at all. All services stopped. :~# ceph -v ceph version 10.2.9-4-gbeaec39 (beaec397f00491079cd74f7b9e3e10660859e26b) ~# uname -a Linux hostname_suppressed 4.4.0-75-generic #96~14.04.1-Ubuntu SMP Thu Apr 20 11:06:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux ~# dpkg -l | grep ceph ii ceph 10.2.9-4-gbeaec39-1trusty amd64 distributed storage and file system ii ceph-base 10.2.9-4-gbeaec39-1trusty amd64 common ceph daemon libraries and management tools ii ceph-common 10.2.9-4-gbeaec39-1trusty amd64 common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 10.2.9-4-gbeaec39-1trusty amd64 common utilities to mount and interact with a ceph file system ii ceph-mds 10.2.9-4-gbeaec39-1trusty amd64 metadata server for the ceph distributed file system ii ceph-mon 10.2.9-4-gbeaec39-1trusty amd64 monitor server for the ceph storage system ii ceph-osd 10.2.9-4-gbeaec39-1trusty amd64 OSD server for the ceph storage system ii libcephfs1 10.2.9-4-gbeaec39-1trusty amd64 Ceph distributed file system client library ii python-cephfs 10.2.9-4-gbeaec39-1trusty amd64 Python libraries for the Ceph libcephfs library Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ*
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com