We've recently recovered from a bit of a disaster where we had some power 
outages (combination of data centre power maintenance and us not having our 
redundant power supplies connected to the correct redundant power circuits - 
lesson learnt).  We ended up with one OSD that wouldn't start - seems like 
perhaps filesystem corruption as an FSCK found and fixed a couple of errors but 
the OSD still wouldn't start so we ended up marking it lost and letting data 
backfill to other OSDs.  Ended up with a handful of 'incomplete' or 
'incomplete/down' pgs which was causing radosgw to stop accepting connections.  
Found a useful blog post from somebody that got us to a point of using the 
ceph-objectstore-tool to determine the correct remaining copy and mark pgs as 
complete.  Backfill operations then wrote out pgs to different OSDs and the 
cluster returned to a health OK state, and radosgw started working normally.  
At least, that's what I thought.
Now, I'm occasionally seeing one OSD crashing every now and then, sometimes 
after a few hours, sometimes after only 10 minutes.  It always starts itself up 
again and the queued up backfills cancel and the cluster returns to OK until 
the next time.  It's always the same OSD and going through the logs just now, 
it seems to always occur when performing a scrub operation on the same pg 
(although, I haven't checked every single instance to be completely sure).

We're running Jewel (yes, I know it's old but we can't upgrade).

Here's the last couple of lines from the OSD log when it crashes on two 
different occasions - I've used the hex code from the "Caught signal" line to 
reference events that match that same code in both instances.  Looks  roughly 
the same on both occasions in that it's always the same pg, however the last 
object shown in the log prior to the crash always seems to be different.

   -91> 2021-04-21 02:37:32.118290 7fed046e6700  5 write_log with: dirty_to: 
0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 
false, divergent_priors: 0, writeout_from: 3110'3174946, trimmed:
   -90> 2021-04-21 02:37:32.118380 7fed046e6700  5 -- op tracker -- seq: 2219, 
time: 2021-04-21 02:37:32.118379, event: commit_queued_for_journal_write, op: 
osd_repop(client.3095191.0:19172420 30.65 
30:a78d321d:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.18432929_1149756827.ogg:head
 v 3110'3174946)
    -1> 2021-04-21 02:37:32.748831 7fed046e6700  5 -- op tracker -- seq: 2221, 
time: 2021-04-21 02:37:32.748830, event: reached_pg, op: replica scrub(pg: 
30.65,from:0'0,to:2923'3171906,epoch:3110,start:30:a63a08df:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.22382834__shadow_.cqXthfu1litKEyNZ53I_voGLwuhonVX_1:0,end:30:a63a13fd:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.26065941_1234799607.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
     0> 2021-04-21 02:37:32.797826 7fed046e6700 -1 os/filestore/FileStore.cc: 
In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, 
IndexedPath*)' thread 7fed046e6700 time 2021-04-21 02:37:32.790356
2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) **
 in thread 7fed046e6700 thread_name:tp_osd_tp
     0> 2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) 
**
 in thread 7fed046e6700 thread_name:tp_osd_tp


    -17> 2021-04-21 03:55:09.090430 7f43382c7700  5 -- op tracker -- seq: 1596, 
time: 2021-04-21 03:55:09.090430, event: done, op: replica scrub(pg: 
30.65,from:0'0,to:2979'3174652,epoch:3122,start:30:a639eb4a:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.17157485_1132337117.ogg:0,end:30:a639f7f1:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.21768594_1188151257.ogg:0,chunky:1,deep:1,seed:4294967295,version:6)
    -5> 2021-04-21 03:55:09.777503 7f43382c7700  5 -- op tracker -- seq: 1598, 
time: 2021-04-21 03:55:09.777476, event: reached_pg, op: replica scrub(pg: 
30.65,from:0'0,to:2929'3172006,epoch:3122,start:30:a63a047c:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.18649542__shadow_.tKOWzKIibnLhX3Bu32FiiuG0FH1lIl4_1:0,end:30:a63a0ea6:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.14411965_1101425097.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
     0> 2021-04-21 03:55:10.089217 7f43382c7700 -1 os/filestore/FileStore.cc: 
In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, 
IndexedPath*)' thread 7f43382c7700 time 2021-04-21 03:55:10.081373
2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) **
 in thread 7f43382c7700 thread_name:tp_osd_tp
     0> 2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) 
**
 in thread 7f43382c7700 thread_name:tp_osd_tp

Any ideas what to do next?

Regards,
Mark Johnson

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to