Dear Community,

 

we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). During
setup, we made the mistake of configuring the OSDs on RAID Volumes.
Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently,
we are in the process of remediating this. After a loss of metadata
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html)
due to resetting the journal (journal entries were not being flushed fast
enough), we managed to bring the cluster back up and started adding 2
additional nodes
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html)
.

 

After adding the two additional nodes, we increased the number of placement
groups to not only accomodate the new nodes, but also to prepare for
reinstallation of the misconfigured nodes. Since then, the number of
placement groups per OSD is too high of course. Despite this fact, cluster
health remained fine over the last few months.

 

However, we are currently observing massive problems: Whenever we try to
access any folder via CephFS, e.g. by listing its contents, there is no
response. Clients are getting blacklisted, but there is no warning. ceph -s
shows everything is ok, except for the number of PGs being too high. If I
grep for "assert" or "error" in any of the logs, nothing comes up. Also, it
is not possible to reduce the number of active MDS to 1. After issuing ,ceph
fs set fs_data max_mds 1' nothing happens.

 

Cluster details are available here: https://gitlab.uni-trier.de/snippets/77 

 

The MDS log  (https://gitlab.uni-trier.de/snippets/79?expanded=true
<https://gitlab.uni-trier.de/snippets/79?expanded=true&viewer=simple>
&viewer=simple) contains no "nicely exporting to" messages as usual, but
instead these:

2019-02-15 08:44:52.464926 7fdb13474700  7 mds.0.server
try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/
[2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993
80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260
10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1
replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw
to mds.1

 

Updates from 12.2.8 to 12.2.11 I ran last week didn't help.

 

Anybody got an idea or a hint where I could look into next? Any help would
be greatly appreciated!

 

Kind regards

Christian Hennen

 

Project Manager Infrastructural Services
ZIMK University of Trier

Germany

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to