;> The other option is what you describe, create a new data pool, make the fs
>> root placed on this pool and copy every file onto itself. This should also
>> do the trick. However, with this method you will not be able to get rid of
>> the broken pool. After the copy, you could, however,
====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Michael Thomas
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
On 10/23/20 3:07 AM, Frank Schilder wrote
Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank,
From my understanding, with my current filesystem layout, I should be
able to remove the "broken" pool once the data has been moved off of it.
This is because the "broken&q
these will work. Please post your experience here.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Michael Thomas
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re
_____
From: Michael Thomas
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
On 10/23/20 3:07 AM, Frank Schilder wrote:
Hi Michael.
I still don't see any traffic to the pool, though I'm al
On 10/23/20 3:07 AM, Frank Schilder wrote:
Hi Michael.
I still don't see any traffic to the pool, though I'm also unsure how much
traffic is to be expected.
Probably not much. If ceph df shows that the pool contains some objects, I
guess that's sorted.
That osdmaptool crashes indicates tha
Hi Michael.
> I still don't see any traffic to the pool, though I'm also unsure how much
> traffic is to be expected.
Probably not much. If ceph df shows that the pool contains some objects, I
guess that's sorted.
That osdmaptool crashes indicates that your cluster runs with corrupted
interna
rds,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects
Sounds good. Did you re-create the pool again?
ng client session (now
defunct) has been blacklisted. I'll check back later to see if the slow
OPS get cleared from 'ceph status'.
Regards,
--Mike
From: Michael Thomas
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Su
,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects
Sounds good. Did you re-create the pool
later to see if the slow
OPS get cleared from 'ceph status'.
Regards,
--Mike
________________
> From: Michael Thomas
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound
w
defunct) has been blacklisted. I'll check back later to see if the slow
OPS get cleared from 'ceph status'.
Regards,
--Mike
From: Michael Thomas
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re:
olves the issue (but tell the user :).
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Michael Thomas
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multipl
Dear Michael,
> > Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an
> > OSD mapping?
I meant here with crush rule replicated_host_nvme. Sorry, forgot.
> Yes, the OSD was still out when the previous health report was created.
Hmm, this is odd. If this is correct, then
On 10/20/20 1:18 PM, Frank Schilder wrote:
Dear Michael,
Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD
mapping?
I meant here with crush rule replicated_host_nvme. Sorry, forgot.
Seems to have worked fine:
https://pastebin.com/PFgDE4J1
Yes, the OSD was st
lly see why the
missing OSDs are not assigned to the two PGs 1.0 and 7.39d.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
____________________
From: Frank Schilder
Sent: 16 October 2020 15:41:29
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple O
trative, like peering attempts.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD cras
see if this has any effect.
The crush rules and crush tree look OK to me. I can't really see why the
missing OSDs are not assigned to the two PGs 1.0 and 7.39d.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
____________________
From:
Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael,
thanks for this initial work. I will need to look through the files you posted
in more detail. In the meantime:
Please mark OSD 41 as
0 02:08:01
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
On 10/14/20 3:49 PM, Frank Schilder wrote:
> Hi Michael,
>
> it doesn't look too bad. All degraded objects are due to the undersized PG.
> If this is an EC pool with m&
This problem may also be related to the below unsolved issue, which
specifically mentions 'unfound' objects. Sadly, there is probably
nothing in the report which will help with your troubleshooting.
https://tracker.ceph.com/issues/44286
C.
___
ceph-
t the incomplete PG resolved with the above, but it will
move some issues out of the way before proceeding.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
____________
From: Michael Thomas
Sent: 14 October 2020 20:52:10
To: Andreas
l Thomas
Sent: 14 October 2020 20:52:10
To: Andreas John; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects
Hello,
The original cause of the OSD instability has already been fixed. It
was due to user jobs (via condor) consuming too much memory and causing
the machine
backup. This will
>>> "park" the problem of cluster health for later fixing.
>>>
>>> Best regads,
>>> =
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> __
_______
> From: Michael Thomas
> Sent: 09 October 2020 22:33:46
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
>
> Hi Frank,
>
> That was a good tip. I was able to move the broken file
n and restore the now missing data from backup. This will
>> "park" the problem of cluster health for later fixing.
>>
>> Best regads,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>>
t regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Michael Thomas
Sent: 09 October 2020 22:33:46
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
Hi Frank,
That was a good ti
er health for later fixing.
Best regads,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 18 September 2020 15:38:51
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects
Dear Micha
" the problem of cluster health for
later fixing.
Best regads,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Frank Schilder
Sent: 18 September 2020 15:38:51
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: mu
Dear Michael,
> I disagree with the statement that trying to recover health by deleting
> data is a contradiction. In some cases (such as mine), the data in ceph
> is backed up in another location (eg tape library). Restoring a few
> files from tape is a simple and cheap operation that takes a m
Hi Frank,
On 9/18/20 2:50 AM, Frank Schilder wrote:
Dear Michael,
firstly, I'm a bit confused why you started deleting data. The objects were
unfound, but still there. That's a small issue. Now the data might be gone and
that's a real issue.
Interval:
Anyone rea
Dear Michael,
firstly, I'm a bit confused why you started deleting data. The objects were
unfound, but still there. That's a small issue. Now the data might be gone and
that's a real issue.
Interval:
Anyone reading this: I have seen many threads where ceph admins s
Hi Frank,
Yes, it does sounds similar to your ticket.
I've tried a few things to restore the failed files:
* Locate a missing object with 'ceph pg $pgid list_unfound'
* Convert the hex oid to a decimal inode number
* Identify the affected file with 'find /ceph -inum $inode'
At this point, I
Sounds similar to this one: https://tracker.ceph.com/issues/46847
If you have or can reconstruct the crush map from before adding the OSDs, you
might be able to discover everything with the temporary reversal of the crush
map method.
Not sure if there is another method, i never got a reply to m
34 matches
Mail list logo