Looks like something went a little wrong with the snapshot metadata in that PG. If the PG is still going active from the other copies, you're probably best off using the ceph-objectstore-tool to remove it on the OSD that is crashing. You could either replace it with an export from one of the other nodes, or let Ceph do the backfilling on its own. -Greg
On Tue, May 15, 2018 at 2:13 AM Siegfried Höllrigl < siegfried.hoellr...@xidras.com> wrote: > > > Hi ! > > We have upgraded our Ceph cluster (3 Mon Servers, 9 OSD Servers, 190 > OSDs total) From 10.2.10 to Ceph 12.2.4 and then to 12.2.5. > (A mixture of Ubuntu 14 and 16 with the Repos from > https://download.ceph.com/debian-luminous/) > > Now we have the Problem that One ODS is crashing again and again > (approx. once per day). systemd restarts it. > > We could now propably identify the problem. It looks like one placement > group (5.9b) causes the crash. > It seems like it doesnt matter if it is running on a filestore or a > bluestore osd. > We could even break it down to some RBDs that were in this pool. > They are already deleted, but it looks like there are some objects on > the osd left, but we cant delete them : > > > rados -p rbd ls > radosrbdls.txt > echo radosrbdls.txt | grep -vE "($(rados -p rbd ls | grep rbd_header | > grep -o "\.[0-9a-f]*" | sed -e :a -e '$!N; s/\n/|/; ta' -e > 's/\./\\./g'))" | grep -E '(rbd_data|journal|rbd_object_map)' > rbd_data.112913b238e1f29.0000000000000e3f > rbd_data.112913b238e1f29.00000000000009d2 > rbd_data.112913b238e1f29.0000000000000ba3 > > rados -p rbd rm rbd_data.112913b238e1f29.0000000000000e3f > error removing rbd>rbd_data.112913b238e1f29.0000000000000e3f: (2) No > such file or directory > rados -p rbd rm rbd_data.112913b238e1f29.00000000000009d2 > error removing rbd>rbd_data.112913b238e1f29.00000000000009d2: (2) No > such file or directory > rados -p rbd rm rbd_data.112913b238e1f29.0000000000000ba3 > error removing rbd>rbd_data.112913b238e1f29.0000000000000ba3: (2) No > such file or directory > > In the "current" directory of the osd there are a lot more files with > this rbd prefix. > Is there any chance to delete these obviously orpahed stuff before the > pg becomes healthy ? > (it is running now at only 2 of 3 osds) > > What else could cause such a crash ? > > > We attatch (hopefully all) of the relevant logs. > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com