Hi all,

I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1 
standby. I have 3 shares: /projects, /home and /scratch, and I've decided to 
try manual pinning as described here: 
http://docs.ceph.com/docs/master/cephfs/multimds/


/projects is pinned to mds.0 (rank 0)

/home and /scratch are pinned to mds.1 (rank 1)

Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[] | 
[.dir.path, .auth_first, .export_pin]'`


Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.


On our test cluster (same version and setup), it works as I think it should. I 
simulate metadata load via mdtest (up to around 2000 req/s on each mds, which 
is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0, loads on the 
other shares go to mds.1. Nothing pops up in the logs. I can also successfully 
reset to no pinning (i.e using the default load balancing) via setting the 
ceph.dir.pin value to -1, and vice versa. All that happens is this show in the 
logs:

....  mds.mds1-test-ceph2 asok_command: get subtrees (starting...)

....  mds.mds1-test-ceph2 asok_command: get subtrees (complete)

However, on our production cluster, with more powerful MDSes (10 cores 3.4GHz, 
256GB RAM, much faster networking), I get this in the logs constantly:

2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting to 
mds.0 [dir 0x1000010cd91.1110* /home/ [2,head] auth{0=1017} v=5632699 
cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84 
55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711 
423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1 
replicated=1 dirty=1 authpin=0 0x55691ccf1c00]

To clarify, /home is pinned to mds.1, so there is no reason it should export 
this to mds.0, and the loads on both MDSes (req/s, network load, CPU load) are 
fairly low, lower than those on the test MDS VMs.

Sometimes (depending on which mds starts first), I would get the same message 
but the other way around i.e "mds.0.migrator nicely exporting to mds.1" the 
workload that mds.0 should be doing. This only appears on one mds, never the 
other, until one is restarted.

And we've had a couple of occasions where we get this sort of slow requests:

7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406 
seconds old, received at 2018-04-20 08:17:35.970498: 
client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116 2018-04-20 
08:17:35.970319 caller_uid=10171, caller_gid=10000{10000,10123,}) currently 
failed to authpin local pins

Which then seems to snowball into thousands of slow requests, until mds.0 is 
restarted. When these slow requests happen, loads are fairly low on the active 
MDSes, although it is possible that the users could be doing something funky 
with metadata on production that I can't reproduce with mdtest.

I thought the manual pinning likely isn't working as intended due to the 
"mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it seems 
to indicate that we have a bad load balancing situation) but I can't seem to 
replicate this issue in test. Test cluster seems to be working as intended.

Am I doing manual pinning right? Should I even be using it?

Cheers,
Linh
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to