[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-09 Thread Michael Thomas
er health for later fixing. Best regads, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 18 September 2020 15:38:51 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Dear Micha

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-14 Thread Michael Thomas
uble-shooting guide? I suspect that the > removal has left something in an inconsistent state that requires manual > clean up for recovery to proceed. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > _

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-14 Thread Michael Thomas
eleted snapshots  in one of the copies. I used > ceph-objectstoretool to remove the "wrong" part. Did you check you OSD > logs? Do the osd go down wirth an obscure stacktrace (and maybe they are > restartet by systemd ...) > > rgds, > > j. > > > > On

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-15 Thread Michael Thomas
t the incomplete PG resolved with the above, but it will move some issues out of the way before proceeding. Best regards, ========= Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 14 October 2020 20:52:10 To: Andreas

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-19 Thread Michael Thomas
trative, like peering attempts. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:09:20 To: Michael Thomas; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD cras

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-19 Thread Michael Thomas
lly see why the missing OSDs are not assigned to the two PGs 1.0 and 7.39d. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:41:29 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple O

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-20 Thread Michael Thomas
On 10/20/20 1:18 PM, Frank Schilder wrote: Dear Michael, Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? I meant here with crush rule replicated_host_nvme. Sorry, forgot. Seems to have worked fine: https://pastebin.com/PFgDE4J1 Yes, the OSD was st

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-21 Thread Michael Thomas
w defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike ________ From: Michael Thomas Sent: 20 October 2020 23:48:36 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re:

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas
I find time today to look at the incomplete PG. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Michael Thomas Sent: 21 October 2020 22:58:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas
rds, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 22 October 2020 09:32:07 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Sounds good. Did you re-create the pool again?

[ceph-users] safest way to re-crush a pool

2020-11-10 Thread Michael Thomas
I'm setting up a radosgw for my ceph Octopus cluster. As soon as I started the radosgw service, I notice that it created a handful of new pools. These pools were assigned the 'replicated_data' crush rule automatically. I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush ru

[ceph-users] Re: safest way to re-crush a pool

2020-11-10 Thread Michael Thomas
dhils...@performair.com www.PerformAir.com -Original Message- From: Michael Thomas [mailto:w...@caltech.edu] Sent: Tuesday, November 10, 2020 1:32 PM To: ceph-users@ceph.io Subject: [ceph-users] safest way to re-crush a pool I'm setting up a radosgw for my ceph Octopus cluster. As soon as

[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Michael Thomas
On 10/23/20 3:07 AM, Frank Schilder wrote: Hi Michael. I still don't see any traffic to the pool, though I'm also unsure how much traffic is to be expected. Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. That osdmaptool crashes indicates tha

[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Michael Thomas
one and the broken PG(s) might get deleted cleanly. Then you still have a surplus pool, but at least all PGs are clean. I hope one of these will work. Please post your experience here. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___

[ceph-users] Prometheus monitoring

2020-11-24 Thread Michael Thomas
I am gathering prometheus metrics from my (unhealthy) Octopus (15.2.4) cluster and notice a discrepency (or misunderstanding) with the ceph dashboard. In the dashboard, and with ceph -s, it reports 807 million objects objects: pgs: 169747/807333195 objects degraded (0.021%)

[ceph-users] Re: Whether removing device_health_metrics pool is ok or not

2020-12-03 Thread Michael Thomas
On 12/3/20 6:47 PM, Satoru Takeuchi wrote: Hi, Could you tell me whether it's ok to remove device_health_metrics pool after disabling device monitoring feature? I don't use device monitoring feature because I capture hardware information from other way. However, after disabling this feature, de

[ceph-users] Re: multiple OSD crash, unfound objects

2020-12-15 Thread Michael Thomas
rank Schilder wrote: Dear Michael, yes, your plan will work if the temporary space requirement can be addressed. Good luck! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 22 November 2

[ceph-users] Removing secondary data pool from mds

2020-12-21 Thread Michael Thomas
I have a cephfs secondary (non-root) data pool with unfound and degraded objects that I have not been able to recover[1]. I created an additional data pool and used "setfattr -n ceph.dir.layout.pool' and a very long rsync to move the files off of the degraded pool and onto the new pool. This

[ceph-users] Re: Removing secondary data pool from mds

2021-02-12 Thread Michael Thomas
1 shard per object and ordinary recovery could fix it. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Michael Thomas Sent: 21 December 2020 23:12:09 To: ceph-users@ceph.io Subject: [ceph-users] Removing secon

[ceph-users] Re: Removing secondary data pool from mds

2021-03-12 Thread Michael Thomas
gh to find out where such an object count comes from. However, ceph df is known to be imperfect. Maybe its just an accounting bug there. I think there were a couple of cases where people deleted all objects in a pool and ceph df would still report non-zero usage. Best regards, = F

[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

2021-04-08 Thread Michael Thomas
Hi Joshua, I have had a similar issue three different times on one of my cephfs pools (15.2.10). The first time this happened I had lost some OSDs. In all cases I ended up with degraded PGs with unfound objects that could not be recovered. Here's how I recovered from the situation. Note th

[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

2021-04-09 Thread Michael Thomas
this up under the assumption that the data is lost? ~Joshua Joshua West President 403-456-0072 CAYK.ca On Thu, Apr 8, 2021 at 6:15 PM Michael Thomas wrote: Hi Joshua, I have had a similar issue three different times on one of my cephfs pools (15.2.10). The first time this happened I had lost

[ceph-users] Re: HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Michael Thomas
I recently had a similar issue when reducing the number of PGs on a pool. A few OSDs became backfillful even though there was enough space; the OSDs were just not balanced well. To fix, I reweighted the most-full OSDs: ceph osd reweight-by-utilization 120 After it finished (~1 hour), I had f

[ceph-users] cephfs auditing

2021-05-27 Thread Michael Thomas
Is there a way to log or track which cephfs files are being accessed? This would help us in planning where to place certain datasets based on popularity, eg on a EC HDD pool or a replicated SSD pool. I know I can run inotify on the ceph clients, but I was hoping that the MDS would have a way t

[ceph-users] ceph-objectstore-tool core dump

2021-10-03 Thread Michael Thomas
I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph cluster. I was able to determine that they are all coming from the same OSD: osd.143. This host recently suffered from an unplanned power loss, so I'm not surprised that there may be some corruption. This PG is part of

[ceph-users] Re: ceph-objectstore-tool core dump

2021-10-03 Thread Michael Thomas
On 10/3/21 12:08, 胡 玮文 wrote: 在 2021年10月4日,00:53,Michael Thomas 写道: I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph cluster. I was able to determine that they are all coming from the same OSD: osd.143. This host recently suffered from an unplanned power loss, so

[ceph-users] Re: [External Email] Re: ceph-objectstore-tool core dump

2021-10-04 Thread Michael Thomas
On 10/4/21 11:57 AM, Dave Hall wrote: > I also had a delay on the start of the repair scrub when I was dealing with > this issue. I ultimately increased the number of simultaneous scrubs, but > I think you could also temporarily disable scrubs and then re-issue the 'pg > repair'. (But I'm not one

[ceph-users] Invalid crush class

2022-10-08 Thread Michael Thomas
In 15.2.7, how can I remove an invalid crush class? I'm surprised that I was able to create it in the first place: [root@ceph1 bin]# ceph osd crush class ls [ "ssd", "JBOD.hdd", "nvme", "hdd" ] [root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd Invalid command: invalid cha

[ceph-users] managed block storage stopped working

2022-01-07 Thread Michael Thomas
...sorta. I have a ovirt-4.4.2 system installed a couple of years ago and set up managed block storage using ceph Octopus[1]. This has been working well since it was originally set up. In late November we had some network issues on one of our ovirt hosts, as well a seperate network issue tha

[ceph-users] Re: managed block storage stopped working

2022-02-09 Thread Michael Thomas
On 1/7/22 16:49, Marc wrote: Where else can I look to find out why the managed block storage isn't accessible anymore? ceph -s ? I guess it is not showing any errors, and there is probably nothing with ceph, you can do an rbdmap and see if you can just map an image. Then try mapping an im

[ceph-users] Re: Rebalance after draining - why?

2022-05-28 Thread Michael Thomas
Try this: ceph osd crush reweight osd.XX 0 --Mike On 5/28/22 15:02, Nico Schottelius wrote: Good evening dear fellow Ceph'ers, when removing OSDs from a cluster, we sometimes use ceph osd reweight osd.XX 0 and wait until the OSD's content has been redistributed. However, when then fin

[ceph-users] pg stuck in unknown state

2020-08-10 Thread Michael Thomas
On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state. It appears to belong to the device_health_metrics pool, which was created automatically by the mgr daemon(?). The OSDs that the PG maps to are all online and serving other PGs. But wh

[ceph-users] Re: pg stuck in unknown state

2020-08-11 Thread Michael Thomas
On 8/11/20 2:52 AM, Wido den Hollander wrote: On 11/08/2020 00:40, Michael Thomas wrote: On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state.  It appears to belong to the device_health_metrics pool, which was created automatica

[ceph-users] Re: pg stuck in unknown state

2020-08-21 Thread Michael Thomas
On 8/11/20 8:35 AM, Michael Thomas wrote: On 8/11/20 2:52 AM, Wido den Hollander wrote: On 11/08/2020 00:40, Michael Thomas wrote: On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state.  It appears to belong to the device_heal

[ceph-users] multiple OSD crash, unfound objects

2020-09-15 Thread Michael Thomas
Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added mor

[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-17 Thread Michael Thomas
there is another method, i never got a reply to my question in the tracker. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 16 September 2020 01:27:19 To: ceph-users@ceph.io Subject: [ceph-user

[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-18 Thread Michael Thomas
weekend so that hopefully the deep scrubs can catch up and possibly locate any missing objects. --Mike Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 17 September 2020 22:27:47 To: Frank Schilder;