[ceph-users] Re: RBD Image Returning 'Unknown Filesystem LVM2_member' On Mount - Help Please

2024-02-04 Thread Cedric
Hello,

Data on a volume should be the same independently on how they are being
accessed.

I would think the volume was previously initialized with an LVM layer, did
"lvs" shows any logical volume on the system ?

On Sun, Feb 4, 2024, 08:56 duluxoz  wrote:

> Hi All,
>
> All of this is using the latest version of RL and Ceph Reef
>
> I've got an existing RBD Image (with data on it - not "critical" as I've
> got a back up, but its rather large so I was hoping to avoid the restore
> scenario).
>
> The RBD Image used to be server out via an (Ceph) iSCSI Gateway, but we
> are now looking to use plain old kernal module.
>
> The RBD Image has been RBD Mapped to the client's /dev/rbd0 location.
>
> So now I'm trying a straight `mount /dev/rbd0 /mount/old_image/` as a test
>
> What I'm getting back is `mount: /mount/old_image/: unknown filesystem
> type 'LVM2_member'.`
>
> All my Google Foo is telling me that to solve this issue I need to
> reformat the image with a new file system - which would mean "losing"
> the data.
>
> So my question is: How can I get to this data using rbd kernal modules
> (the iSCSI Gateway is no longer available, so not an option), or am I
> stuck with the restore option?
>
> Or is there something I'm missing (which would not surprise me in the
> least)?  :-)
>
> Thanks in advance (as always, you guys and gals are really, really helpful)
>
> Cheers
>
>
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-19 Thread Cedric
Hello,

Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
encounter an issue with a cache pool becoming completely stuck,
relevant messages below:

pg xx.x has invalid (post-split) stats; must scrub before tier agent
can activate

In OSD logs, scrubs are starting in a loop without succeeding for all
pg of this pool.

What we already tried without luck so far:

- shutdown / restart OSD
- rebalance pg between OSD
- raise the memory on OSD
- repeer PG

Any idea what is causing this? any help will be greatly appreciated

Thanks

Cédric
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-21 Thread Cedric
Update: we have run fsck and re-shard on all bluestore volume, seems sharding 
were not applied.

Unfortunately scrubs and deep-scrubs are still stuck on PGs of the pool that is 
suffering the issue, but other PGs scrubs well.

The next step will be to remove the cache tier as suggested, but its not 
available yet as PGs needs to be scrubbed in order for the cache tier can be 
activated.

As we are struggling to make this cluster works again, any help would be 
greatly appreciated.

Cédric

> On 20 Feb 2024, at 20:22, Cedric  wrote:
> 
> Thanks Eugen, sorry about the missed reply to all.
> 
> The reason we still have the cache tier is because we were not able to flush 
> all dirty entry to remove it (as per the procedure), so the cluster as been 
> migrated from HDD/SSD to NVME a while ago but tiering remains, unfortunately.
> 
> So actually we are trying to understand the root cause
> 
> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block  wrote:
>> 
>> Please don't drop the list from your response.
>> 
>> The first question coming to mind is, why do you have a cache-tier if 
>> all your pools are on nvme decices anyway? I don't see any benefit here.
>> Did you try the suggested workaround and disable the cache-tier?
>> 
>> Zitat von Cedric :
>> 
>>> Thanks Eugen, see attached infos.
>>> 
>>> Some more details:
>>> 
>>> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
>>> rados -p vms_cache cache-flush-evict-all
>>> - all scrub running on vms_caches pgs are stall / start in a loop
>>> without actually doing anything
>>> - all io are 0 both from ceph status or iostat on nodes
>>> 
>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> some more details would be helpful, for example what's the pool size
>>>> of the cache pool? Did you issue a PG split before or during the
>>>> upgrade? This thread [1] deals with the same problem, the described
>>>> workaround was to set hit_set_count to 0 and disable the cache layer
>>>> until that is resolved. Afterwards you could enable the cache layer
>>>> again. But keep in mind that the code for cache tier is entirely
>>>> removed in Reef (IIRC).
>>>> 
>>>> Regards,
>>>> Eugen
>>>> 
>>>> [1]
>>>> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd
>>>> 
>>>> Zitat von Cedric :
>>>> 
>>>>> Hello,
>>>>> 
>>>>> Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
>>>>> encounter an issue with a cache pool becoming completely stuck,
>>>>> relevant messages below:
>>>>> 
>>>>> pg xx.x has invalid (post-split) stats; must scrub before tier agent
>>>>> can activate
>>>>> 
>>>>> In OSD logs, scrubs are starting in a loop without succeeding for all
>>>>> pg of this pool.
>>>>> 
>>>>> What we already tried without luck so far:
>>>>> 
>>>>> - shutdown / restart OSD
>>>>> - rebalance pg between OSD
>>>>> - raise the memory on OSD
>>>>> - repeer PG
>>>>> 
>>>>> Any idea what is causing this? any help will be greatly appreciated
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Cédric
>>>>> ___
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>>> 
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Cedric
Thanks Eugen for the suggestion, yes we have tried, also repeering
concerned PGs, still the same issue.

Looking at the code it seems the split-mode message is triggered when
the PG as ""stats_invalid": true,", here is the result of a query:

"stats_invalid": true,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,

I also provide again cluster informations that was lost in previous
missed reply all. Don't hesitate to ask more if needed I would be
glade to provide them.

Cédric


On Thu, Feb 22, 2024 at 11:04 AM Eugen Block  wrote:
>
> Hm, I wonder if setting (and unsetting after a while) noscrub and
> nodeep-scrub has any effect. Have you tried that?
>
> Zitat von Cedric :
>
> > Update: we have run fsck and re-shard on all bluestore volume, seems
> > sharding were not applied.
> >
> > Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
> > pool that is suffering the issue, but other PGs scrubs well.
> >
> > The next step will be to remove the cache tier as suggested, but its
> > not available yet as PGs needs to be scrubbed in order for the cache
> > tier can be activated.
> >
> > As we are struggling to make this cluster works again, any help
> > would be greatly appreciated.
> >
> > Cédric
> >
> >> On 20 Feb 2024, at 20:22, Cedric  wrote:
> >>
> >> Thanks Eugen, sorry about the missed reply to all.
> >>
> >> The reason we still have the cache tier is because we were not able
> >> to flush all dirty entry to remove it (as per the procedure), so
> >> the cluster as been migrated from HDD/SSD to NVME a while ago but
> >> tiering remains, unfortunately.
> >>
> >> So actually we are trying to understand the root cause
> >>
> >> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block  wrote:
> >>>
> >>> Please don't drop the list from your response.
> >>>
> >>> The first question coming to mind is, why do you have a cache-tier if
> >>> all your pools are on nvme decices anyway? I don't see any benefit here.
> >>> Did you try the suggested workaround and disable the cache-tier?
> >>>
> >>> Zitat von Cedric :
> >>>
> >>>> Thanks Eugen, see attached infos.
> >>>>
> >>>> Some more details:
> >>>>
> >>>> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
> >>>> rados -p vms_cache cache-flush-evict-all
> >>>> - all scrub running on vms_caches pgs are stall / start in a loop
> >>>> without actually doing anything
> >>>> - all io are 0 both from ceph status or iostat on nodes
> >>>>
> >>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block  wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> some more details would be helpful, for example what's the pool size
> >>>>> of the cache pool? Did you issue a PG split before or during the
> >>>>> upgrade? This thread [1] deals with the same problem, the described
> >>>>> workaround was to set hit_set_count to 0 and disable the cache layer
> >>>>> until that is resolved. Afterwards you could enable the cache layer
> >>>>> again. But keep in mind that the code for cache tier is entirely
> >>>>> removed in Reef (IIRC).
> >>>>>
> >>>>> Regards,
> >>>>> Eugen
> >>>>>
> >>>>> [1]
> >>>>> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-adding-a-cache-osd
> >>>>>
> >>>>> Zitat von Cedric :
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
> >>>>>> encounter an issue with a cache pool becoming completely stuck,
> >>>>>> relevant messages below:
> >>>>>>
> >>>>>> pg xx.x has invalid (post-split) stats; must scrub before tier agent
> >>>>>> can activate
> >>>>>>
> >>>>>> In OSD logs, scrubs are starting in a loop without

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Cedric
Yes the osd_scrub_invalid_stats is set to true.

We are thinking about the use of "ceph pg_mark_unfound_lost revert"
action, but we wonder if there is a risk of data loss.

On Thu, Feb 22, 2024 at 11:50 AM Eugen Block  wrote:
>
> I found a config to force scrub invalid PGs, what is your current
> setting on that?
>
> ceph config get osd osd_scrub_invalid_stats
> true
>
> The config reference states:
>
> > Forces extra scrub to fix stats marked as invalid.
>
> But the default seems to be true, so I'd expect it's true in your case
> as well?
>
> Zitat von Cedric :
>
> > Thanks Eugen for the suggestion, yes we have tried, also repeering
> > concerned PGs, still the same issue.
> >
> > Looking at the code it seems the split-mode message is triggered when
> > the PG as ""stats_invalid": true,", here is the result of a query:
> >
> > "stats_invalid": true,
> > "dirty_stats_invalid": false,
> > "omap_stats_invalid": false,
> > "hitset_stats_invalid": false,
> > "hitset_bytes_stats_invalid": false,
> > "pin_stats_invalid": false,
> > "manifest_stats_invalid": false,
> >
> > I also provide again cluster informations that was lost in previous
> > missed reply all. Don't hesitate to ask more if needed I would be
> > glade to provide them.
> >
> > Cédric
> >
> >
> > On Thu, Feb 22, 2024 at 11:04 AM Eugen Block  wrote:
> >>
> >> Hm, I wonder if setting (and unsetting after a while) noscrub and
> >> nodeep-scrub has any effect. Have you tried that?
> >>
> >> Zitat von Cedric :
> >>
> >> > Update: we have run fsck and re-shard on all bluestore volume, seems
> >> > sharding were not applied.
> >> >
> >> > Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
> >> > pool that is suffering the issue, but other PGs scrubs well.
> >> >
> >> > The next step will be to remove the cache tier as suggested, but its
> >> > not available yet as PGs needs to be scrubbed in order for the cache
> >> > tier can be activated.
> >> >
> >> > As we are struggling to make this cluster works again, any help
> >> > would be greatly appreciated.
> >> >
> >> > Cédric
> >> >
> >> >> On 20 Feb 2024, at 20:22, Cedric  wrote:
> >> >>
> >> >> Thanks Eugen, sorry about the missed reply to all.
> >> >>
> >> >> The reason we still have the cache tier is because we were not able
> >> >> to flush all dirty entry to remove it (as per the procedure), so
> >> >> the cluster as been migrated from HDD/SSD to NVME a while ago but
> >> >> tiering remains, unfortunately.
> >> >>
> >> >> So actually we are trying to understand the root cause
> >> >>
> >> >> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block  wrote:
> >> >>>
> >> >>> Please don't drop the list from your response.
> >> >>>
> >> >>> The first question coming to mind is, why do you have a cache-tier if
> >> >>> all your pools are on nvme decices anyway? I don't see any benefit 
> >> >>> here.
> >> >>> Did you try the suggested workaround and disable the cache-tier?
> >> >>>
> >> >>> Zitat von Cedric :
> >> >>>
> >> >>>> Thanks Eugen, see attached infos.
> >> >>>>
> >> >>>> Some more details:
> >> >>>>
> >> >>>> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
> >> >>>> rados -p vms_cache cache-flush-evict-all
> >> >>>> - all scrub running on vms_caches pgs are stall / start in a loop
> >> >>>> without actually doing anything
> >> >>>> - all io are 0 both from ceph status or iostat on nodes
> >> >>>>
> >> >>>> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block  wrote:
> >> >>>>>
> >> >>>>> Hi,
> >> >>>>>
> >> >>>>> some more details would be helpful, for example what's the pool size
> >> >>>>> of the cache pool? Did you issue a PG split before or duri

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-22 Thread Cedric
On Thu, Feb 22, 2024 at 12:37 PM Eugen Block  wrote:
> You haven't told yet if you changed the hit_set_count to 0.

Not yet, we will give it a try ASAP

> Have you already tried to set the primary PG out and wait for the
> backfill to finish?

No, we will try also

> And another question, are all services running pacific already and on
> the same version (ceph versions)?

Yes, all daemon runs 16.2.13

>
> Zitat von Cedric :
>
> > Yes the osd_scrub_invalid_stats is set to true.
> >
> > We are thinking about the use of "ceph pg_mark_unfound_lost revert"
> > action, but we wonder if there is a risk of data loss.
> >
> > On Thu, Feb 22, 2024 at 11:50 AM Eugen Block  wrote:
> >>
> >> I found a config to force scrub invalid PGs, what is your current
> >> setting on that?
> >>
> >> ceph config get osd osd_scrub_invalid_stats
> >> true
> >>
> >> The config reference states:
> >>
> >> > Forces extra scrub to fix stats marked as invalid.
> >>
> >> But the default seems to be true, so I'd expect it's true in your case
> >> as well?
> >>
> >> Zitat von Cedric :
> >>
> >> > Thanks Eugen for the suggestion, yes we have tried, also repeering
> >> > concerned PGs, still the same issue.
> >> >
> >> > Looking at the code it seems the split-mode message is triggered when
> >> > the PG as ""stats_invalid": true,", here is the result of a query:
> >> >
> >> > "stats_invalid": true,
> >> > "dirty_stats_invalid": false,
> >> > "omap_stats_invalid": false,
> >> > "hitset_stats_invalid": false,
> >> > "hitset_bytes_stats_invalid": false,
> >> > "pin_stats_invalid": false,
> >> > "manifest_stats_invalid": false,
> >> >
> >> > I also provide again cluster informations that was lost in previous
> >> > missed reply all. Don't hesitate to ask more if needed I would be
> >> > glade to provide them.
> >> >
> >> > Cédric
> >> >
> >> >
> >> > On Thu, Feb 22, 2024 at 11:04 AM Eugen Block  wrote:
> >> >>
> >> >> Hm, I wonder if setting (and unsetting after a while) noscrub and
> >> >> nodeep-scrub has any effect. Have you tried that?
> >> >>
> >> >> Zitat von Cedric :
> >> >>
> >> >> > Update: we have run fsck and re-shard on all bluestore volume, seems
> >> >> > sharding were not applied.
> >> >> >
> >> >> > Unfortunately scrubs and deep-scrubs are still stuck on PGs of the
> >> >> > pool that is suffering the issue, but other PGs scrubs well.
> >> >> >
> >> >> > The next step will be to remove the cache tier as suggested, but its
> >> >> > not available yet as PGs needs to be scrubbed in order for the cache
> >> >> > tier can be activated.
> >> >> >
> >> >> > As we are struggling to make this cluster works again, any help
> >> >> > would be greatly appreciated.
> >> >> >
> >> >> > Cédric
> >> >> >
> >> >> >> On 20 Feb 2024, at 20:22, Cedric  wrote:
> >> >> >>
> >> >> >> Thanks Eugen, sorry about the missed reply to all.
> >> >> >>
> >> >> >> The reason we still have the cache tier is because we were not able
> >> >> >> to flush all dirty entry to remove it (as per the procedure), so
> >> >> >> the cluster as been migrated from HDD/SSD to NVME a while ago but
> >> >> >> tiering remains, unfortunately.
> >> >> >>
> >> >> >> So actually we are trying to understand the root cause
> >> >> >>
> >> >> >> On Tue, Feb 20, 2024 at 1:43 PM Eugen Block  wrote:
> >> >> >>>
> >> >> >>> Please don't drop the list from your response.
> >> >> >>>
> >> >> >>> The first question coming to mind is, why do you have a cache-tier 
> >> >> >>> if
> >> >> >>> all your pools are on nvme decices anyway? I don't see any
> >>

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-28 Thread Cedric
Hello,

Sorry for the late reply, so yes we finally find a solution, which was to split 
apart the cache_pool on dedicated OSD. It had the effect to clear off slow ops 
and allow the cluster to serves clients again, after 5 days of lock down, 
hopefully the majority of VM resume well, thanks to the virtio driver that does 
not seems to have any timeout.

It seems that at least one of the main culprit was to store both cold and hot 
data pool on same OSD (which in the end totally make sens), maybe some others 
actions engaged also had an effect, we are still trying to trouble shoot the 
root of slow ops, weirdly it was the 5th cluster upgraded and all as almost the 
same configuration, but this one handles 5x time more workload.

In the hope it could help.

Cédric

> On 26 Feb 2024, at 10:57, Eugen Block  wrote:
> 
> Hi,
> 
> thanks for the context. Was there any progress over the weekend? The hanging 
> commands seem to be MGR related, and there's only one in your cluster 
> according to your output. Can you deploy a second one manually, then adopt it 
> with cephadm? Can you add 'ceph versions' as well?
> 
> 
> Zitat von florian.le...@socgen.com:
> 
>> Hi,
>> A bit of history might help to understand why we have the cache tier.
>> 
>> We run openstack on top ceph since many years now (started with mimic, then 
>> an upgrade to nautilus (years 2 ago) and today and upgrade to pacific). At 
>> the beginning of the setup, we used to have a mix of hdd+ssd devices in HCI 
>> mode for openstack nova. After the upgrade to nautilus, we made a hardware 
>> refresh with brand new NVME devices. And transitionned from mixed devices to 
>> nvme. But we were never able to evict all the data from the vms_cache pools 
>> (even with being aggressive with the eviction; the last resort would have 
>> been to stop all the virtual instances, and that was not an option for our 
>> customers), so we decided to move on and set cache-mode proxy and serve data 
>> with only nvme since then. And it's been like this for 1 years and a half.
>> 
>> But today, after the upgrade, the situation is that we cannot query any 
>> stats (with ceph pg x.x query), rados query hangs, scrub hangs even though 
>> all PGs are "active+clean". and there is no client activity reported by the 
>> cluster. Recovery, and rebalance. Also some other commands hangs, ie: "ceph 
>> balancer status".
>> 
>> --
>> bash-4.2$ ceph -s
>>  cluster:
>>id: 
>>health: HEALTH_WARN
>>mon is allowing insecure global_id reclaim
>>noscrub,nodeep-scrub,nosnaptrim flag(s) set
>>18432 slow ops, oldest one blocked for 7626 sec, daemons 
>> [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]...
>>  have slow ops.
>> 
>>  services:
>>mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
>>mgr: bm9612541(active, since 39m)
>>osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
>> flags noscrub,nodeep-scrub,nosnaptrim
>> 
>>  data:
>>pools:   8 pools, 2409 pgs
>>objects: 14.64M objects, 92 TiB
>>usage:   276 TiB used, 143 TiB / 419 TiB avail
>>pgs: 2409 active+clean
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-03-01 Thread Cedric
Not really, as unfortunately the cache eviction fails for some rbd
objects that still hace some "lock", right now we need to understand
why the eviction fails on these objects, and find a solution to have
the cache eviction fully working. I will provide more information
later on.

If you have any pointers, well they will be greatly appreciated.

Cheers

On Wed, Feb 28, 2024 at 9:50 PM Eugen Block  wrote:
>
> Hi,
>
> great that you found a solution. Maybe that also helps to get rid of
> the cache-tier entirely?
>
> Zitat von Cedric :
>
> > Hello,
> >
> > Sorry for the late reply, so yes we finally find a solution, which
> > was to split apart the cache_pool on dedicated OSD. It had the
> > effect to clear off slow ops and allow the cluster to serves clients
> > again, after 5 days of lock down, hopefully the majority of VM
> > resume well, thanks to the virtio driver that does not seems to have
> > any timeout.
> >
> > It seems that at least one of the main culprit was to store both
> > cold and hot data pool on same OSD (which in the end totally make
> > sens), maybe some others actions engaged also had an effect, we are
> > still trying to trouble shoot the root of slow ops, weirdly it was
> > the 5th cluster upgraded and all as almost the same configuration,
> > but this one handles 5x time more workload.
> >
> > In the hope it could help.
> >
> > Cédric
> >
> >> On 26 Feb 2024, at 10:57, Eugen Block  wrote:
> >>
> >> Hi,
> >>
> >> thanks for the context. Was there any progress over the weekend?
> >> The hanging commands seem to be MGR related, and there's only one
> >> in your cluster according to your output. Can you deploy a second
> >> one manually, then adopt it with cephadm? Can you add 'ceph
> >> versions' as well?
> >>
> >>
> >> Zitat von florian.le...@socgen.com:
> >>
> >>> Hi,
> >>> A bit of history might help to understand why we have the cache tier.
> >>>
> >>> We run openstack on top ceph since many years now (started with
> >>> mimic, then an upgrade to nautilus (years 2 ago) and today and
> >>> upgrade to pacific). At the beginning of the setup, we used to
> >>> have a mix of hdd+ssd devices in HCI mode for openstack nova.
> >>> After the upgrade to nautilus, we made a hardware refresh with
> >>> brand new NVME devices. And transitionned from mixed devices to
> >>> nvme. But we were never able to evict all the data from the
> >>> vms_cache pools (even with being aggressive with the eviction; the
> >>> last resort would have been to stop all the virtual instances, and
> >>> that was not an option for our customers), so we decided to move
> >>> on and set cache-mode proxy and serve data with only nvme since
> >>> then. And it's been like this for 1 years and a half.
> >>>
> >>> But today, after the upgrade, the situation is that we cannot
> >>> query any stats (with ceph pg x.x query), rados query hangs, scrub
> >>> hangs even though all PGs are "active+clean". and there is no
> >>> client activity reported by the cluster. Recovery, and rebalance.
> >>> Also some other commands hangs, ie: "ceph balancer status".
> >>>
> >>> --
> >>> bash-4.2$ ceph -s
> >>>  cluster:
> >>>id: 
> >>>health: HEALTH_WARN
> >>>mon is allowing insecure global_id reclaim
> >>>noscrub,nodeep-scrub,nosnaptrim flag(s) set
> >>>18432 slow ops, oldest one blocked for 7626 sec,
> >>> daemons
> >>> [osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]...
> >>>  have slow
> >>> ops.
> >>>
> >>>  services:
> >>>mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
> >>>mgr: bm9612541(active, since 39m)
> >>>osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
> >>> flags noscrub,nodeep-scrub,nosnaptrim
> >>>
> >>>  data:
> >>>pools:   8 pools, 2409 pgs
> >>>objects: 14.64M objects, 92 TiB
> >>>usage:   276 TiB used, 143 TiB / 419 TiB avail
> >>>pgs: 2409 active+clean
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs not balanced

2024-03-04 Thread Cedric
Did the balancer has enabled pools ? "ceph balancer pool ls"

Actually I am wondering if the balancer do something when no pools are
added.



On Mon, Mar 4, 2024, 11:30 Ml Ml  wrote:

> Hello,
>
> i wonder why my autobalancer is not working here:
>
> root@ceph01:~# ceph -s
>   cluster:
> id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
> health: HEALTH_ERR
> 1 backfillfull osd(s)
> 1 full osd(s)
> 1 nearfull osd(s)
> 4 pool(s) full
>
> => osd.17 was too full (92% or something like that)
>
> root@ceph01:~# ceph osd df tree
> ID   CLASS  WEIGHT REWEIGHT  SIZE ... %USE  ... PGS TYPE NAME
> -25 209.50084 -  213 TiB  ... 69.56 ...   - datacenter
> xxx-dc-root
> -19  84.59369 -   86 TiB  ... 56.97 ...   - rack
> RZ1.Reihe4.R10
>  -3  35.49313 -   37 TiB  ... 57.88 ...   - host
> ceph02
>   2hdd1.7   1.0  1.7 TiB  ... 58.77 ...  44
>  osd.2
>   3hdd1.0   1.0  2.7 TiB  ... 22.14 ...  25
>  osd.3
>   7hdd2.5   1.0  2.7 TiB  ... 58.84 ...  70
>  osd.7
>   9hdd9.5   1.0  9.5 TiB  ... 63.07 ... 268
>  osd.9
>  13hdd2.67029   1.0  2.7 TiB  ... 53.59 ...  65
>  osd.13
>  16hdd2.8   1.0  2.7 TiB  ... 59.35 ...  71
>  osd.16
>  19hdd1.7   1.0  1.7 TiB  ... 48.98 ...  37
>  osd.19
>  23hdd2.38419   1.0  2.4 TiB  ... 59.33 ...  64
>  osd.23
>  24hdd1.3   1.0  1.7 TiB  ... 51.23 ...  39
>  osd.24
>  28hdd3.63869   1.0  3.6 TiB  ... 64.17 ... 104
>  osd.28
>  31hdd2.7   1.0  2.7 TiB  ... 64.73 ...  76
>  osd.31
>  32hdd3.3   1.0  3.3 TiB  ... 67.28 ... 101
>  osd.32
>  -9  22.88817 -   23 TiB  ... 56.96 ...   - host
> ceph06
>  35hdd7.15259   1.0  7.2 TiB  ... 55.71 ... 182
>  osd.35
>  36hdd5.24519   1.0  5.2 TiB  ... 53.75 ... 128
>  osd.36
>  45hdd5.24519   1.0  5.2 TiB  ... 60.91 ... 144
>  osd.45
>  48hdd5.24519   1.0  5.2 TiB  ... 57.94 ... 139
>  osd.48
> -17  26.21239 -   26 TiB  ... 55.67 ...   - host
> ceph08
>  37hdd6.67569   1.0  6.7 TiB  ... 58.17 ... 174
>  osd.37
>  40hdd9.53670   1.0  9.5 TiB  ... 58.54 ... 250
>  osd.40
>  46hdd5.0   1.0  5.0 TiB  ... 52.39 ... 116
>  osd.46
>  47hdd5.0   1.0  5.0 TiB  ... 50.05 ... 112
>  osd.47
> -20  59.11053 -   60 TiB  ... 82.47 ...   - rack
> RZ1.Reihe4.R9
>  -4  23.09996 -   24 TiB  ... 79.92 ...   - host
> ceph03
>   5hdd1.7   0.75006  1.7 TiB  ... 87.24 ...  66
>  osd.5
>   6hdd1.7   0.44998  1.7 TiB  ... 47.30 ...  36
>  osd.6
>  10hdd2.7   0.85004  2.7 TiB  ... 83.23 ... 100
>  osd.10
>  15hdd2.7   0.75006  2.7 TiB  ... 74.26 ...  88
>  osd.15
>  17hdd0.5   0.85004  1.6 TiB  ... 91.44 ...  67
>  osd.17
>  20hdd2.0   0.85004  1.7 TiB  ... 88.41 ...  68
>  osd.20
>  21hdd2.7   0.75006  2.7 TiB  ... 77.25 ...  91
>  osd.21
>  25hdd1.7   0.90002  1.7 TiB  ... 78.31 ...  60
>  osd.25
>  26hdd2.7   1.0  2.7 TiB  ... 82.75 ...  99
>  osd.26
>  27hdd2.7   0.90002  2.7 TiB  ... 84.26 ... 101
>  osd.27
>  63hdd1.8   0.90002  1.7 TiB  ... 84.15 ...  65
>  osd.63
> -13  36.01057 -   36 TiB  ... 84.12 ...   - host
> ceph05
>  11hdd7.15259   0.90002  7.2 TiB  ... 85.45 ... 273
>  osd.11
>  39hdd7.2   0.85004  7.2 TiB  ... 80.90 ... 257
>  osd.39
>  41hdd7.2   0.75006  7.2 TiB  ... 74.95 ... 239
>  osd.41
>  42hdd9.0   1.0  9.5 TiB  ... 92.00 ... 392
>  osd.42
>  43hdd5.45799   1.0  5.5 TiB  ... 84.84 ... 207
>  osd.43
> -21  65.79662 -   66 TiB  ... 74.29 ...   - rack
> RZ3.Reihe3.R10
>  -2  28.49664 -   29 TiB  ... 74.79 ...   - host
> ceph01
>   0hdd2.7   1.0  2.7 TiB  ... 73.82 ...  88
>  osd.0
>   1hdd3.63869   1.0  3.6 TiB  ... 73.47 ... 121
>  osd.1
>   4hdd2.7   1.0  2.7 TiB  ... 74.63 ...  89
>  osd.4
>   8hdd2.7   1.0  2.7 TiB  ... 77.10 ...  92
>  osd.8
>  12hdd2.7   1.0  2.7 TiB  ... 78.76 ...  94
>  osd.12
>  14hdd5.45799   1.0  5.5 TiB  ... 78.86 ... 193
>  osd.14
>  18hdd1.8   1.0  2.7 TiB  ... 63.79 ...  76
>  osd.18
>  22hdd1.7   1.0  1.7 TiB  ... 74.85 ...  57
>  osd.22
>  30hdd1.7   1.0  1.7 TiB  ... 76.34 ...  59
>  osd.30
>  64hdd3.2   1.0  3.3 TiB  ... 73.48 ... 110
>  osd.64
> -11  12.3 -   12 TiB  ... 73.40 ...   - host
> ceph04
>  34hdd5.2   1.0  5.2 TiB  ... 72.81 ... 171
>  osd.34
>  44hdd7.2   1.0 

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-26 Thread Cedric
What about drives IOPS ? Hdd tops at an average 150, you can use iostat
-xmt to get these values (also last column show disk utilization which is
very usefull)

On Sun, May 26, 2024, 09:37 Mazzystr  wrote:

> I can't explain the problem.  I have to recover three discs that are hdds.
> I figured on just replacing one to give the full recovery capacity of the
> cluster to that one disc.  I was never able to achieve a higher recovery
> rate than about 22 MiB/sec so I just added the other two discs.  Recovery
> bounced up to 129 MiB/sec for a while.  Then things settled at 50 MiB/sec.
> I kept tinkering to try to get back to 120 and now things are back to 23
> MiB/Sec again.  This is very irritating.
>
> Cpu usage is minimal in the single digit %.  Mem is right on target per
> target setting in ceph.conf.  Disc's and network appear to be 20%
> utilized.
>
> I'm not a normal Ceph user.  I don't care about client access at all.  The
> mclock assumptions are wrong for me.  I want my data to be replicated
> correctly as fast as possible.
>
> How do I open up the floodgates for maximum recovery performance?
>
>
>
>
> On Sat, May 25, 2024 at 8:13 PM Zakhar Kirpichenko 
> wrote:
>
> > Hi!
> >
> > Could you please elaborate what you meant by "adding another disc to the
> > recovery process"?
> >
> > /Z
> >
> >
> > On Sat, 25 May 2024, 22:49 Mazzystr,  wrote:
> >
> >> Well this was an interesting journey through the bowels of Ceph.  I have
> >> about 6 hours into tweaking every setting imaginable just to circle back
> >> to
> >> my basic configuration and 2G memory target per osd.  I was never able
> to
> >> exceed 22 Mib/Sec recovery time during that journey.
> >>
> >> I did end up fixing the issue and now I see the following -
> >>
> >>   io:
> >> recovery: 129 MiB/s, 33 objects/s
> >>
> >> This is normal for my measly cluster.  I like micro ceph clusters.  I
> have
> >> a lot of them. :)
> >>
> >> What was the fix?  Adding another disc to the recovery process!  I was
> >> recovering to one disc now I'm recovering to two.  I have three total
> that
> >> need to be recovered.  Somehow that one disc was completely swamped.  I
> >> was
> >> unable to see it in htop, atop, iostat.  Disc business was 6% max.
> >>
> >> My config is back to mclock scheduler, profile high_recovery_ops, and
> >> backfills of 256.
> >>
> >> Thank you everyone that took the time to review and contribute.
> Hopefully
> >> this provides some modern information for the next person that has slow
> >> recovery.
> >>
> >> /Chris C
> >>
> >>
> >>
> >>
> >>
> >> On Fri, May 24, 2024 at 1:43 PM Kai Stian Olstad 
> >> wrote:
> >>
> >> > On 24.05.2024 21:07, Mazzystr wrote:
> >> > > I did the obnoxious task of updating ceph.conf and restarting all my
> >> > > osds.
> >> > >
> >> > > ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get
> >> > > osd_op_queue
> >> > > {
> >> > > "osd_op_queue": "wpq"
> >> > > }
> >> > >
> >> > > I have some spare memory on my target host/osd and increased the
> >> target
> >> > > memory of that OSD to 10 Gb and restarted.  No effect observed.  In
> >> > > fact
> >> > > mem usage on the host is stable so I don't think the change took
> >> effect
> >> > > even with updating ceph.conf, restart and a direct asok config set.
> >> > > target
> >> > > memory value is confirmed to be set via asok config get
> >> > >
> >> > > Nothing has helped.  I still cannot break the 21 MiB/s barrier.
> >> > >
> >> > > Does anyone have any more ideas?
> >> >
> >> > For recovery you can adjust the following.
> >> >
> >> > osd_max_backfills default is 1, in my system I get the best
> performance
> >> > with 3 and wpq.
> >> >
> >> > The following I have not adjusted myself, but you can try.
> >> > osd_recovery_max_active is default to 3.
> >> > osd_recovery_op_priority is default to 3, a lower number increases the
> >> > priority for recovery.
> >> >
> >> > All of them can be runtime adjusted.
> >> >
> >> >
> >> > --
> >> > Kai Stian Olstad
> >> >
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-26 Thread Cedric
Also osd_max_backfills and osd_recovery_max_active can plays a role, but I
wonder if they still has effect with the new mpq feature.

On Sun, May 26, 2024, 09:37 Mazzystr  wrote:

> I can't explain the problem.  I have to recover three discs that are hdds.
> I figured on just replacing one to give the full recovery capacity of the
> cluster to that one disc.  I was never able to achieve a higher recovery
> rate than about 22 MiB/sec so I just added the other two discs.  Recovery
> bounced up to 129 MiB/sec for a while.  Then things settled at 50 MiB/sec.
> I kept tinkering to try to get back to 120 and now things are back to 23
> MiB/Sec again.  This is very irritating.
>
> Cpu usage is minimal in the single digit %.  Mem is right on target per
> target setting in ceph.conf.  Disc's and network appear to be 20%
> utilized.
>
> I'm not a normal Ceph user.  I don't care about client access at all.  The
> mclock assumptions are wrong for me.  I want my data to be replicated
> correctly as fast as possible.
>
> How do I open up the floodgates for maximum recovery performance?
>
>
>
>
> On Sat, May 25, 2024 at 8:13 PM Zakhar Kirpichenko 
> wrote:
>
> > Hi!
> >
> > Could you please elaborate what you meant by "adding another disc to the
> > recovery process"?
> >
> > /Z
> >
> >
> > On Sat, 25 May 2024, 22:49 Mazzystr,  wrote:
> >
> >> Well this was an interesting journey through the bowels of Ceph.  I have
> >> about 6 hours into tweaking every setting imaginable just to circle back
> >> to
> >> my basic configuration and 2G memory target per osd.  I was never able
> to
> >> exceed 22 Mib/Sec recovery time during that journey.
> >>
> >> I did end up fixing the issue and now I see the following -
> >>
> >>   io:
> >> recovery: 129 MiB/s, 33 objects/s
> >>
> >> This is normal for my measly cluster.  I like micro ceph clusters.  I
> have
> >> a lot of them. :)
> >>
> >> What was the fix?  Adding another disc to the recovery process!  I was
> >> recovering to one disc now I'm recovering to two.  I have three total
> that
> >> need to be recovered.  Somehow that one disc was completely swamped.  I
> >> was
> >> unable to see it in htop, atop, iostat.  Disc business was 6% max.
> >>
> >> My config is back to mclock scheduler, profile high_recovery_ops, and
> >> backfills of 256.
> >>
> >> Thank you everyone that took the time to review and contribute.
> Hopefully
> >> this provides some modern information for the next person that has slow
> >> recovery.
> >>
> >> /Chris C
> >>
> >>
> >>
> >>
> >>
> >> On Fri, May 24, 2024 at 1:43 PM Kai Stian Olstad 
> >> wrote:
> >>
> >> > On 24.05.2024 21:07, Mazzystr wrote:
> >> > > I did the obnoxious task of updating ceph.conf and restarting all my
> >> > > osds.
> >> > >
> >> > > ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get
> >> > > osd_op_queue
> >> > > {
> >> > > "osd_op_queue": "wpq"
> >> > > }
> >> > >
> >> > > I have some spare memory on my target host/osd and increased the
> >> target
> >> > > memory of that OSD to 10 Gb and restarted.  No effect observed.  In
> >> > > fact
> >> > > mem usage on the host is stable so I don't think the change took
> >> effect
> >> > > even with updating ceph.conf, restart and a direct asok config set.
> >> > > target
> >> > > memory value is confirmed to be set via asok config get
> >> > >
> >> > > Nothing has helped.  I still cannot break the 21 MiB/s barrier.
> >> > >
> >> > > Does anyone have any more ideas?
> >> >
> >> > For recovery you can adjust the following.
> >> >
> >> > osd_max_backfills default is 1, in my system I get the best
> performance
> >> > with 3 and wpq.
> >> >
> >> > The following I have not adjusted myself, but you can try.
> >> > osd_recovery_max_active is default to 3.
> >> > osd_recovery_op_priority is default to 3, a lower number increases the
> >> > priority for recovery.
> >> >
> >> > All of them can be runtime adjusted.
> >> >
> >> >
> >> > --
> >> > Kai Stian Olstad
> >> >
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph orch issue: lsblk: /dev/vg_osd/lvm_osd: not a block device

2024-05-26 Thread Cedric
Not sure you need (or you should) prepare the block device manualy, ceph
can handle these tasks. Did you try to cleanup and retry by providing
/dev/sda6 with the ceph orch daemon add ?

On Sun, May 26, 2024, 10:50 duluxoz  wrote:

> Hi All,
>
> Is the following a bug or some other problem (I can't tell)  :-)
>
> Brand new Ceph (Reef v18.2.3) install on Rocky Linux v9.4 - basically,
> its a brand new box.
>
> Ran the following commands (in order; no issues until final command):
>
>  1. pvcreate /dev/sda6
>  2. vgcreate vg_osd /dev/sda6
>  3. lvcreate -l 100%VG -n lv_osd vg_osd
>  4. cephadmbootstrap--mon-ip192.168.0.20
>  5. ceph orch daemon add osd ceph1:/dev/vg_osd/lvm_osd
>
> Received a whole bunch of error info on the console; the two relevant
> lines (as far as I can tell) are:
>
>   * /usr/bin/podman: stderr  stderr: lsblk: /dev/vg_osd/lvm_osd: not a
> block device
>   * RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
> --privileged --group-add=disk --init -e
> CONTAINER_IMAGE=
> quay.io/ceph/ceph@sha256:257b3f5140c11b51fd710ffdad6213ed53d74146f464a51717262d156daef553
> -e NODE_NAME=ceph1 -e CEPH_USE_RANDOM_NONCE=1 -e
> CEPH_VOLUME_OSDSPEC_AFFINITY=None -e CEPH_VOLUME_SKIP_RESTORECON=yes
> -e CEPH_VOLUME_DEBUG=1 -v
> /var/run/ceph/477045f4-1b34-11ef-9a30-0800274c7359:/var/run/ceph:z
> -v
> /var/log/ceph/477045f4-1b34-11ef-9a30-0800274c7359:/var/log/ceph:z
> -v
>
> /var/lib/ceph/477045f4-1b34-11ef-9a30-0800274c7359/crash:/var/lib/ceph/crash:z
> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
> /run/lock/lvm:/run/lock/lvm -v
>
> /var/lib/ceph/477045f4-1b34-11ef-9a30-0800274c7359/selinux:/sys/fs/selinux:ro
> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
> /tmp/ceph-tmpe_krhtt8:/etc/ceph/ceph.conf:z -v
> /tmp/ceph-tmp_47jsxdp:/var/lib/ceph/bootstrap-osd/ceph.keyring:z
>
> quay.io/ceph/ceph@sha256:257b3f5140c11b51fd710ffdad6213ed53d74146f464a51717262d156daef553
> lvm batch --no-auto /dev/vg_osd/lvm_osd --yes --no-systemd
>
> I had a look around the Net and couldn't find anything relevant. This
> post (https://github.com/rook/rook/issues/4967) talks about a similar
> issue using Rook, but I'm not using Rook but cephadm.
>
> Any help in resolving this (or confirming it is a bug) would be greatly
> appreciated - thanks in advance.
>
> Cheers
>
> Dulux-Oz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to list pg-upmap-items

2024-12-12 Thread Cedric
FYI you can also set the balancer mode to crush-compat, this way even if
the balancer is re enabled for any reason error messages will not occurs.

https://docs.ceph.com/en/pacific/rados/operations/balancer/

On Thu, Dec 12, 2024, 15:28 Janne Johansson  wrote:

> I have clusters that have been upgraded into "upmap"-capable releases,
> but in those cases, it was never in upmap mode, since these clusters
> would also have jewel-clients as lowest possible, so if you tried to
> enable balancer in upmap mode it would tell me to first bump clients
> to luminous at least, then allow upmap mode on the balancer.
>
> Den tors 12 dec. 2024 kl 14:37 skrev Matt Vandermeulen <
> stor...@reenigne.net>:
> >
> > As you discovered, it looks like there are no upmap items in your
> > cluster right now. The `ceph osd dump` command will list them, in JSON
> > as you show, or you can `grep ^pg_upmap` without JSON as well (same
> > output, different format).
> >
> > I think the balancer would have been enabled by default in Nautilus, I'm
> > surprised this hit you now. You can make sure it's off with `ceph
> > balancer off` so that it won't do anything in the future, and check its
> > status with `ceph balancer status`.
> >
> > Thanks,
> > Matt
> >
> >
> > On 2024-12-12 08:37, Frank Schilder wrote:
> > > Dear all,
> > >
> > > during our upgrade from octopus to pacific the MGR suddenly started
> > > logging messages like this one to audit.log:
> > >
> > > 2024-12-10T10:30:01.105524+0100 mon.ceph-03 (mon.2) 3004 : audit [INF]
> > > from='mgr.424622547 192.168.32.67:0/63' entity='mgr.ceph-03'
> > > cmd=[{"prefix": "osd pg-upmap-items", "format": "json", "pgid": "1.60",
> > > "id": [1054, 1125]}]: dispatch
> > >
> > > Apparently, the balancer got enabled and tried to do something.
> > > However, we never enabled pg-upmap on our cluster, because we still
> > > have jewel clients from the museum connected. Therefore, I'm pretty
> > > certain that all of these upmap requests either failed or are scheduled
> > > and pending.
> > >
> > > To be sure, I would like to confirm that nothing happened. How can I
> > > list upmap items and scheduled+pending upmap operations? If there are
> > > any, how do I delete these? I really would like to avoid that these
> > > requests start hurting in a few years from now. I looked at the
> > > documentation. Unfortunately, its the usual disease[1], commands for
> > > setting all sorts of stuff are documented, but commands to query
> > > anything seem to be missing.
> > >
> > > This workaround
> > >
> > > [root@gnosis osdmaps]# ceph osd dump -f json-pretty | grep upmap
> > > "pg_upmap": [],
> > > "pg_upmap_items": [],
> > >
> > > indicates nothing is screwed up yet. However, I would really like to
> > > know what happened to the MGR commands and where they are now. How do I
> > > confirm they went to digital heaven?
> > >
> > > [1] There are "ceph osd pg-upmap-items : set upmap items" and "ceph osd
> > > rm-pg-upmap-items : clear upmap items" commands. Why would anyone ever
> > > need a "ceph osd ls-pg-upmap-items"?? I found out that I can write it
> > > myself
> > > (
> https://ceph-users.ceph.narkive.com/h7y24SDg/stale-pg-upmap-items-entries-after-pg-increase
> > > and
> > >
> https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py#L102
> ).
> > > However, a good API is always symmetric to make it *easy* for users to
> > > check and fix screw-ups.
> > >
> > > Thanks and best regards,
> > > =
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Error ENOENT: Module not found

2025-01-25 Thread Cedric
Encountered this issue recently, restarting mgrs did the trick.

Cheers

On Sat, Jan 25, 2025, 06:26 Devender Singh  wrote:

> Thanks for you reply… but those command not working as its an always
> module..but strange still showing error,
>
> # ceph  mgr module enable orchestrator
> module 'orchestrator' is already enabled (always-on)
>
> # ceph orch set backend  — returns successfully…
>
> # # ceph orch ls
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
>
> Its revolving between same error..
>
> Root Cause: I removed a hosts and its odd’s and after some time above
> error started automatically.
>
> Earlier in the had 5  nodes  but now 4.. Cluster is showing  unclean pg
> but not doing anything..
>
> But big error is Error ENOENT:
>
>
> Regards
> Dev
>
> > On Jan 24, 2025, at 4:59 PM, Fnu Virender Kumar 
> wrote:
> >
> > Did you try
> >
> > Ceph mgr module enable orchestrator
> > Ceph orch set backend
> > Ceph orch ls
> >
> > Check the mgr service daemon as well
> > Ceph -s
> >
> >
> > Regards
> > Virender
> > From: Devender Singh 
> > Sent: Friday, January 24, 2025 6:34:43 PM
> > To: ceph-users 
> > Subject: [ceph-users] Error ENOENT: Module not found
> >
> >
> > Hello all
> >
> > Any quick fix for …
> >
> > root@sea-devnode1:~# ceph orch ls
> > Error ENOENT: Module not found
> >
> >
> > Regards
> > Dev
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Automatic OSD activation after host reinstall

2025-02-14 Thread Cedric
Could it be related to automatic OSD deployment ?

https://docs.ceph.com/en/reef/cephadm/services/#disabling-automatic-deployment-of-daemons



On Fri, Feb 14, 2025, 08:40 Eugen Block  wrote:

> Good morning,
>
> this week I observed something new, I think. At least I can't recall
> having seen that yet. Last week I upgraded a customer cluster to
> 18.2.4 (everything worked fine except RGWs keep crashing [0]), this
> week I reinstalled the OS on one of the hosts. And after a successful
> registry login, the OSDs were automatically activated without me
> having to run 'ceph cephadm osd activate ' as documented in [1].
> Zac improved the docs just last week, is that obsolete now? Or is that
> version specific? A few weeks ago we reinstalled our own Ceph servers
> as well, but we still run Pacific 16.2.15, and there I had to issue
> the OSD activation manually. Can anyone confirm this?
>
> Thanks!
> Eugen
>
> [0] https://tracker.ceph.com/issues/69885
> [1]
>
> https://docs.ceph.com/en/latest/cephadm/services/osd/#activate-existing-osds
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io