[ceph-users] Re: Can't get one OSD (out of 14) to start

Mark Johnson Fri, 16 Apr 2021 19:47:41 -0700

Querying the problem pgs gives me the following:

1.38:
{
    "state": "incomplete",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 2247,
    "up": [
        17,
        4
    ],
    "acting": [
        17,
        4
    ],


.....

                    "up": [
                        14,
                        6
                    ],
                    "acting": [
                        14,
                        6
                    ],
                    "primary": 14,
                    "up_primary": 14
                },

             .........

            "probing_osds": [
                "4",
                "17",
                "22"
            ],
            "down_osds_we_would_probe": [
                6
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound"
                }
            ]

30.c1:
{
    "state": "down+incomplete",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 2247,
    "up": [
        14,
        25
    ],
    "acting": [
        14,
        25
    ],

......

            "up": [
                14,
                25
            ],
            "acting": [
                14,
                25
            ],
            "blocked_by": [
                6
            ],

.....

            "probing_osds": [
                "14",
                "25"
            ],
            "down_osds_we_would_probe": [
                6
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound"
                }
            ]

Which both mention the "lost" osd 6 in the down_osds_we_would_probe and mention 
of being "blocked by 6".

I've seen this thread - 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html 
which mentions using ceph-objectstore-tool on the primary OSD and mark as 
complete, or another response said to pick a "winner" PG out of those available 
and use ceph-objectstore-tool to remove the other one and "hopefully the winner 
you left alone will allow the pg to recover and go active".  But, I'm kinda 
lost as to whether or not these are correct or not as there's no follow up 
response to say what (if anything) worked.



If I try to query the other two, they just hang there, which is even more 
concerning, and I have to break out at which point I see this:

RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
"pgid": "30.7a"}']": exception 'int' object is not iterable

and

RuntimeError: "None": exception "['{"prefix": "get_command_descriptions", 
"pgid": "30.8d"}']": exception 'int' object is not iterable

I'm currently able to write some data into Ceph via rados but it appears some 
is failing, presumably due to it wanting to write to the problem pgs.



-----Original Message-----
From: Mark Johnson 
<ma...@iovox.com<mailto:mark%20johnson%20%3cma...@iovox.com%3e>>
To: a...@iss-integration.com 
<a...@iss-integration.com<mailto:%2...@iss-integration.com%22%20%3...@iss-integration.com%3e>>
Cc: ceph-users@ceph.io 
<ceph-users@ceph.io<mailto:%22ceph-us...@ceph.io%22%20%3cceph-us...@ceph.io%3e>>
Subject: [ceph-users] Re: Can't get one OSD (out of 14) to start
Date: Sat, 17 Apr 2021 02:20:37 +0000


All the backfill operations are complete and I'm now just left with the 3 
incomplete and 1 down+incomplete:


# ceph health detail

HEALTH_ERR 4 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 4 
pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean; 266 requests are 
blocked > 32 sec; 3 osds have slow requests

pg 1.38 is stuck inactive for 80654.111975, current state incomplete, last 
acting [17,4]

pg 30.7a is stuck inactive for 76259.649932, current state incomplete, last 
acting [12,9]

pg 30.8d is stuck inactive for 76201.794001, current state incomplete, last 
acting [0,5]

pg 30.c1 is stuck inactive for 76305.051390, current state down+incomplete, 
last acting [14,25]

pg 1.38 is stuck unclean for 80654.112037, current state incomplete, last 
acting [17,4]

pg 30.7a is stuck unclean for 76259.649989, current state incomplete, last 
acting [12,9]

pg 30.8d is stuck unclean for 76201.794058, current state incomplete, last 
acting [0,5]

pg 30.c1 is stuck unclean for 76305.051447, current state down+incomplete, last 
acting [14,25]

pg 30.c1 is down+incomplete, acting [14,25]

pg 30.8d is incomplete, acting [0,5]

pg 30.7a is incomplete, acting [12,9]

pg 1.38 is incomplete, acting [17,4]

50 ops are blocked > 33554.4 sec on osd.14

16 ops are blocked > 16777.2 sec on osd.14

2 ops are blocked > 67108.9 sec on osd.12

98 ops are blocked > 33554.4 sec on osd.12

100 ops are blocked > 33554.4 sec on osd.0

3 osds have slow requests



I tried issuing a 'ceph pg repair' to one of those PGs and got the following:


# ceph pg repair 1.38

instructing pg 1.38 on osd.17 to repair


But it doesn't appear to be doing anything.  Health status still says the exact 
same thing.  No idea where to go from here.



-----Original Message-----

From: Mark Johnson <

<mailto:ma...@iovox.com>

ma...@iovox.com

<mailto:

<mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com

%3e>>

To:

<mailto:a...@iss-integration.com>

a...@iss-integration.com

 <

<mailto:a...@iss-integration.com>

a...@iss-integration.com

<mailto:

<mailto:%2...@iss-integration.com>

%2...@iss-integration.com

<mailto:%22%20%3...@iss-integration.com>

%22%20%3...@iss-integration.com

%3e>>

Cc:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

 <

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

<mailto:

<mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io

<mailto:%22%20%3cceph-us...@ceph.io>

%22%20%3cceph-us...@ceph.io

%3e>>

Subject: [ceph-users] Re: Can't get one OSD (out of 14) to start

Date: Fri, 16 Apr 2021 22:00:20 +0000



That's the exact same page I used to mark the osd as lost.  Nothing in there 
seems to reference the incomplete and down+incomplete pgs that I have however 
so I really don't know if it helps me.  I don't really understand what my 
problem is here.





-----Original Message-----


From: Alex Gorbachev <


<mailto:

<mailto:a...@iss-integration.com>

a...@iss-integration.com

>


<mailto:a...@iss-integration.com>

a...@iss-integration.com



<mailto:


<mailto:

<mailto:alex%20gorbachev%20%3...@iss-integration.com>

alex%20gorbachev%20%3...@iss-integration.com

>


<mailto:alex%20gorbachev%20%3...@iss-integration.com>

alex%20gorbachev%20%3...@iss-integration.com



%3e>>


To: Mark Johnson <


<mailto:

<mailto:ma...@iovox.com>

ma...@iovox.com

>


<mailto:ma...@iovox.com>

ma...@iovox.com



<mailto:


<mailto:

<mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com

>


<mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com



%3e>>


Cc:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



 <


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io

>


<mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io



<mailto:

<mailto:%22%20%3cceph-us...@ceph.io>

%22%20%3cceph-us...@ceph.io

>


<mailto:%22%20%3cceph-us...@ceph.io>

%22%20%3cceph-us...@ceph.io



%3e>>


Subject: Re: [ceph-users] Re: Can't get one OSD (out of 14) to start


Date: Fri, 16 Apr 2021 14:16:28 -0400



Hi Mark,



I wonder if the following will help you:


<

<https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/>

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/

>


<https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/>

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/





There are instructions there on how to mark unfound PGs lost and delete them.  
You will regain a healthy cluster that way, and then you can adjust replica 
counts etc to best practice, and restore your objects.



Best regards,


--


Alex Gorbachev


ISS/Storcium





On Fri, Apr 16, 2021 at 10:51 AM Mark Johnson <


<mailto:

<mailto:ma...@iovox.com>

ma...@iovox.com

>


<mailto:ma...@iovox.com>

ma...@iovox.com



<mailto:


<mailto:

<mailto:ma...@iovox.com>

ma...@iovox.com

>


<mailto:ma...@iovox.com>

ma...@iovox.com



wrote:


I ran an fsck on the problem OSD and found and repaired a couple of errors.  
Remounted and started the OSD but it crashed again shortly after as before.  So 
(and possibly from bad advise) I figured I'd mark the OSD lost and let it write 
out the pgs to other OSDs which it's in the process of backfilling.  However, 
I'm seeing 1 down+incomplete and 3 incomplete and I'm expecting that these 
won't recover.



So, would love to know what my options are here when all the backfilling has 
finished (or stalled).  Losing data or even entire PGs isn't a big problem as 
this cluster is really just a replica of our main cluster so we can restore 
lost objects manually from there.  Is there a way I can clear 
out/repair/whatever these pgs so I can get a healthy cluster again?



Yes, I know this would have probably been easier with an additional storage 
server and a pool size of 3.  But that's not going to help me right now.





-----Original Message-----


From: Mark Johnson <


<mailto:

<mailto:ma...@iovox.com>

ma...@iovox.com

>


<mailto:ma...@iovox.com>

ma...@iovox.com



<mailto:


<mailto:

<mailto:ma...@iovox.com>

ma...@iovox.com

>


<mailto:ma...@iovox.com>

ma...@iovox.com



<mailto:


<mailto:

<mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com

>


<mailto:mark%20johnson%20%3cma...@iovox.com>

mark%20johnson%20%3cma...@iovox.com



<mailto:


<mailto:

<mailto:mark%2520johnson%2520%253cma...@iovox.com>

mark%2520johnson%2520%253cma...@iovox.com

>


<mailto:mark%2520johnson%2520%253cma...@iovox.com>

mark%2520johnson%2520%253cma...@iovox.com



%3e>>


To:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io

>


<mailto:%22ceph-us...@ceph.io>

%22ceph-us...@ceph.io



<mailto:


<mailto:

<mailto:22ceph-us...@ceph.io>

22ceph-us...@ceph.io

>


<mailto:22ceph-us...@ceph.io>

22ceph-us...@ceph.io





<mailto:

<mailto:%22%20%3cceph-us...@ceph.io>

%22%20%3cceph-us...@ceph.io

>


<mailto:%22%20%3cceph-us...@ceph.io>

%22%20%3cceph-us...@ceph.io



<mailto:


<mailto:

<mailto:22%2520%253cceph-us...@ceph.io>

22%2520%253cceph-us...@ceph.io

>


<mailto:22%2520%253cceph-us...@ceph.io>

22%2520%253cceph-us...@ceph.io



%3e>>


Subject: [ceph-users] Can't get one OSD (out of 14) to start


Date: Fri, 16 Apr 2021 12:43:33 +0000




Really not sure where to go with this one.  Firstly, a description of my 
cluster.  Yes, I know there are a lot of "not ideals" here but this is what I 
inherited.




The cluster is running Jewel and has two storage/mon nodes and an additional 
mon only node, with a pool size of 2.  Today, we had a some power issues in the 
data centre and we very ungracefully lost both storage servers at the same 
time.  Node 1 came back online before node 2 but I could see there were a few 
OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  
Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but 
one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  
I'm not finding the OSD log output to be much use.  Current health status looks 
like this:




# ceph health



HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 
pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are 
blocked > 32 sec



# ceph status



    cluster e2391bbf-15e0-405f-af12-943610cb4909



     health HEALTH_ERR



            26 pgs are stuck inactive for more than 300 seconds



            26 pgs down



            26 pgs peering



            26 pgs stuck inactive



            26 pgs stuck unclean



            5 requests are blocked > 32 sec




Any clues as to what I should be looking for or what sort of action I should be 
taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.




Here's a snippet from the OSD log that means little to me...




--- begin dump of recent events ---



     0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) 
**



 in thread 7f2e23921ac0 thread_name:ceph-osd




 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)



 1: (()+0x9f1c2a) [0x7f2e24330c2a]



 2: (()+0xf5d0) [0x7f2e21ee95d0]



 3: (gsignal()+0x37) [0x7f2e2049f207]



 4: (abort()+0x148) [0x7f2e204a08f8]



 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x7f2e2442fd47]



 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) 
[0x7f2e2417bc7c]



 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) 
[0x7f2e240c8dce]



 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]



 9: (OSD::init()+0x27d) [0x7f2e23d5828d]



 10: (main()+0x2c18) [0x7f2e23c71088]



 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]



 12: (()+0x3c8847) [0x7f2e23d07847]



 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.




Thanks in advance,



Mark




_______________________________________________



ceph-users mailing list --



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io






<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io







To unsubscribe send an email to



<mailto:


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io



<mailto:


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io






<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io



<mailto:


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io






_______________________________________________


ceph-users mailing list --


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io



<mailto:


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io





To unsubscribe send an email to


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io



<mailto:


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io





_______________________________________________


ceph-users mailing list --


<mailto:

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io

>


<mailto:ceph-users@ceph.io>

ceph-users@ceph.io




To unsubscribe send an email to


<mailto:

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

>


<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io



_______________________________________________

ceph-users mailing list --

<mailto:ceph-users@ceph.io>

ceph-users@ceph.io


To unsubscribe send an email to

<mailto:ceph-users-le...@ceph.io>

ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Can't get one OSD (out of 14) to start

Reply via email to