[ceph-users] Re: Old MDS container version when: Ceph orch apply mds

2024-08-02 Thread Eugen Block

Hi,

it sounds like the mds container_image is not configured properly, you  
can set it via:


ceph config set mds container_image quay.io/ceph/ceph:v18.2.2

or just set it globally for all ceph daemons:

ceph config set global container_image quay.io/ceph/ceph:v18.2.2

If you bootstrap a fresh cluster, the image is set globally for you,  
but it doesn't do that during an upgrade from a non-cephadm cluster,  
which requires to redeploy mds daemons.


Regards,
Eugen


Zitat von opositor...@mail.com:


Hi All,
I migrated my CEPH 18.2.2 cluster from a non cephadm configuration.  
All goes fine except MDS service was deployed in a old version: 17.0.0
I'm trying to deploy  MDS daemons using ceph orch but CEPH always  
download an old MDS image from docker.


How could I deploy the MDS service in the same 18.2.2 version that  
the rest of services?


[root@master1 ~]# ceph orch apply mds datafs --placement="2 master1 master2"

[root@master1 ~]# ceph orch ps
NAME   HOST PORTS  STATUS REFRESHED   
AGE  MEM USE  MEM LIM  VERSIONIMAGE ID   
CONTAINER ID
mds.datafs.master1.gcpovr  master1 running (36m) 6m ago   
36m37.2M-  17.0.0-7183-g54142666  75e3d7089cea   
96682779c7ad
mds.datafs.master2.oqaxuy  master2 running (36m) 6m ago   
36m33.1M-  17.0.0-7183-g54142666  75e3d7089cea   
a9a647f87c83
mgr.master master1 running (16h) 6m ago   
17h 448M-  18.2.2 3c937764e6f5   
70f06fa05b70
mgr.master2master2 running (16h) 6m ago   
17h 524M-  18.2.2 3c937764e6f5   
2d0d5376d8b3
mon.master master1 running (16h) 6m ago   
17h 384M2048M  18.2.2 3c937764e6f5   
66a65017ce29
mon.master2master2 running (16h) 6m ago   
17h 380M2048M  18.2.2 3c937764e6f5   
51d783a9e36c
osd.0  osd00   running (16h) 3m ago   
17h 432M4096M  18.2.2 3c937764e6f5   
fedff66f5ed2
osd.1  osd00   running (16h) 3m ago   
17h 475M4096M  18.2.2 3c937764e6f5   
24e24a1a22e6
osd.2  osd00   running (16h) 3m ago   
17h 516M4096M  18.2.2 3c937764e6f5   
ccd05451b739
osd.3  osd00   running (16h) 3m ago   
17h 454M4096M  18.2.2 3c937764e6f5   
f6d8f13c8aaf
osd.4  master1 running (16h) 6m ago   
17h 525M4096M  18.2.2 3c937764e6f5   
a2dcf9f1a9b7
osd.5  master2 running (16h) 6m ago   
17h 331M4096M  18.2.2 3c937764e6f5   
b0011e8561a4


[root@master1 ~]# ceph orch ls
NAMEPORTS  RUNNING  REFRESHED  AGE  PLACEMENT
mds.datafs 2/2  6m ago 46s  master1;master2;count:2
mgr2/0  6m ago -
mon2/0  6m ago -
osd  6  6m ago -

[root@master1 ~]# ceph versions
{
"mon": {
"ceph version 18.2.2  
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)": 2

},
"mgr": {
"ceph version 18.2.2  
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)": 2

},
"osd": {
"ceph version 18.2.2  
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)": 6

},
"mds": {
"ceph version 17.0.0-7183-g54142666  
(54142666e5705ced88e3e2d91ddc0ff29867a362) quincy (dev)": 2

},
"overall": {
"ceph version 17.0.0-7183-g54142666  
(54142666e5705ced88e3e2d91ddc0ff29867a362) quincy (dev)": 2,
"ceph version 18.2.2  
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)": 10

}
}

[root@master1 ~]# podman images
REPOSITORYTAG  IMAGE ID   
CREATEDSIZE
quay.io/ceph/ceph v18.2.2  3c937764e6f5   
7 weeks ago1.28 GB
quay.io/ceph/ceph v18  3c937764e6f5   
7 weeks ago1.28 GB
registry.access.redhat.com/ubi8   latest   c70d72aaebb4   
3 months ago   212 MB
quay.io/ceph/ceph v16  0d668911f040   
23 months ago  1.27 GB
quay.io/ceph/ceph-grafana 8.3.5dad864ee21e9   
2 years ago571 MB
quay.io/prometheus/prometheus v2.33.4  514e6a882f6e   
2 years ago205 MB
quay.io/prometheus/node-exporter  v1.3.1   1dbe0e931976   
2 years ago22.3 MB
quay.io/prometheus/alertmanager   v0.23.0  ba2b418f427c   
2 years ago58.9 MB
docker.io/ceph/daemon-baselatest-master-devel  75e3d7089cea   
2 years ago1.29 GB

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.

[ceph-users] Re: Help with osd spec needed

2024-08-02 Thread Eugen Block

Hi,

if you assigned the SSD to be for block.db it won't be available from  
the orchestrator's point of view as a data device. What you could try  
is to manually create a partition or LV on the remaining SSD space and  
then point the service spec to that partition/LV via path spec. I  
haven't tried that myself though, so I have no clue if that'll work.


Regards,
Eugen

Zitat von Kristaps Cudars :


3 nodes each:
3 hdd – 21G
1 ssd – 80G

Create osd containing block_data with block_db size 15G located on ssd-
This par works
Create block_data osd on remaining space 35G in ssd- This part is not
working

ceph orch apply osd -i /path/to/osd_spec.yml

service_type: osd
service_id: osd_spec_hdd
placement:
  host_pattern: '*'
spec:
  block_db_size: 15G
  db_slots: 3
  data_devices:
rotational: 1
  db_devices:
rotational: 0
---
service_type: osd
service_id: odd_spec_ssd
placement:
  host_pattern: '*'
spec:
  data_devices:
rotational: 0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd-mirror keeps crashing

2024-08-02 Thread Eugen Block

Hi,

can you verify if all images are readable? Maybe there's a corrupt  
journal for an image and it fails to read it? Just a wild guess, I  
can't really interpret the stack trace. Or are there some images  
without journaling enabled or something? Are there some logs  
available, maybe even debug logs where you could find the responsible  
image? Is this the first run of the rbd-mirror or has it worked before?


Regards,
Eugen

Zitat von arm...@armsby.dk:


Hi everyone,

I've been running rbd-mirror between my old Ceph system (16.2.10)  
and my new system (18.2.2). I'm using journaling mode on a pool that  
contains 7,500 images. Everything was running perfectly until it  
processed about 5,608 images. Now, it keeps crashing with the  
following message:


2024-07-19T05:49:32.425+ 7f582b3fd6c0  0 set uid:gid to 167:167  
(ceph:ceph)
2024-07-19T05:49:32.425+ 7f582b3fd6c0  0 ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable), process  
rbd-mirror, pid 7
2024-07-19T05:49:32.429+ 7f582b3fd6c0  1 mgrc  
service_daemon_register rbd-mirror.3606956688 metadata {arch=x86_64,  
ceph_release=pacific, ceph_version=ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable),  
ceph_version_short=16.2.10, container_hostname=mon-001,  
container_image=quay.io/ceph/ceph@sha256:2b68483bcd050472a18e73389c0e1f3f70d34bb7abf733f692e88c935ea0a6bd, cpu=Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, distro=centos, distro_description=CentOS Stream 8, distro_version=8, hostname=mon-001, id=mon-001.lcqrti, instance_id=3606956688, kernel_description=#1 SMP Mon Jul 18 17:42:52 UTC 2022, kernel_version=4.18.0-408.el8.x86_64, mem_swap_kb=4194300, mem_total_kb=131393360,  
os=Linux}
2024-07-19T05:50:28.305+ 7f5812582700 -1  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: In function 'void Thread::create(const char*, size_t)' thread 7f5812582700 time 2024-07-19T05:50:28.303536+ /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/common/Thread.cc: 165: FAILED ceph_assert(ret ==  
0)


ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17)  
pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char  
const*)+0x158) [0x7f58218b6de8]

 2: /usr/lib64/ceph/libceph-common.so.2(+0x277002) [0x7f58218b7002]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x362fd7) [0x7f58219a2fd7]
 4: (CommonSafeTimer::init()+0x1fe) [0x7f58219a963e]
 5:  
(journal::Journaler::Threads::Threads(ceph::common::CephContext*)+0x2fc)  
[0x55c9b33c6ddc]
 6: (journal::Journaler::Journaler(librados::v14_2_0::IoCtx&,  
std::__cxx11::basic_string,  
std::allocator > const&, std::__cxx11::basic_stringstd::char_traits, std::allocator > const&,  
journal::Settings const&, journal::CacheManagerHandler*)+0x50)  
[0x55c9b33c6f10]
 7:  
(librbd::Journal::get_tag_owner(librados::v14_2_0::IoCtx&,  
std::__cxx11::basic_string,  
std::allocator >&, std::__cxx11::basic_stringstd::char_traits, std::allocator >*,  
librbd::asio::ContextWQ*, Context*)+0x19f) [0x55c9b2fa65af]
 8:  
(librbd::mirror::GetInfoRequest::get_journal_tag_owner()+0x210)  
[0x55c9b31869f0]
 9:  
(librbd::mirror::GetInfoRequest::handle_get_mirror_image(int)+0x8c8)  
[0x55c9b3189d78]

 10: /lib64/librados.so.2(+0xa8546) [0x7f582aedb546]
 11: /lib64/librados.so.2(+0xc17e5) [0x7f582aef47e5]
 12: /lib64/librados.so.2(+0xc3742) [0x7f582aef6742]
 13: /lib64/librados.so.2(+0xc914a) [0x7f582aefc14a]
 14: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f581fb03ba3]
 15: /lib64/libpthread.so.0(+0x81ca) [0x7f5820cec1ca]
 16: clone()


Has anyone encountered a similar problem or have any insight into  
what might be causing this crash?


Thanks in advance for your help.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Difficulty importing bluestore OSDs from the old cluster (bad fsid) - OSD does not start

2024-08-02 Thread Eugen Block

Hi,

have you tried updating the label and the fsid in the osd's data directory?

ceph-bluestore-tool set-label-key --path /var/lib/ceph/osd/ceph-0 -k  
ceph_fsid -v 


And then you'll also need to change /var/lib/ceph/osd/ceph-0/ceph_fsid  
to reflect the desired fsid. It's been a while since I had to fiddle  
with blutestore-tool, I'm not sure if this will be sufficient, but it  
might work.


Another way could be to bootstrap a new cluster, there you can specify  
the fsid:


# cephadm bootstrap -h | grep fsid
 [--mgr-id MGR_ID] [--fsid FSID]
  --fsid FSID   cluster FSID

But I have no idea how to do that in proxmox.

Regards,
Eugen

Zitat von Vinícius Barreto :


Hello everybody! I am fascinated by Ceph, but now I live in moments of
terror and despair. I'm using ceph 18.2.2 (reef) and at the moment We need
to import 4 OSDs from an old cluster (which was removed by accident).
In short, we suspect that the cause and solution for this case lies in the
information in the OSD log below: (all OSD daemons repeat the same
information)
After importing I receive the following information in the "journalctl -u
ceph-osd@2" log:
---
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  in thread 702342c006c0
thread_name:ms_dispatch
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  ceph version 18.2.2
(e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  1:
/lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x70235785b050]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  2:
/lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x7023578a9e2c]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  3: gsignal()
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  4: abort()
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  5: (ceph::__ceph_abort(char const*,
int, char const*, std::__cxx11::basic_string,
std::allocator > const&)+0x18a) [0x62edaeab47>
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  6:
(OSD::handle_osd_map(MOSDMap*)+0x384a) [0x62edaec0aeca]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  7:
(OSD::ms_dispatch(Message*)+0x62) [0x62edaec0b332]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  8:
(Messenger::ms_deliver_dispatch(boost::intrusive_ptr const&)+0xc1)
[0x62edaf5eea51]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  9: (DispatchQueue::entry()+0x6cf)
[0x62edaf5ed53f]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  10:
(DispatchQueue::DispatchThread::entry()+0xd) [0x62edaf40fd5d]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  11:
/lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x7023578a8134]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  12:
/lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x7023579287dc]
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  NOTE: a copy of the executable, or
`objdump -rdS ` is needed to interpret this.
Jul 17 20:09:55 pxm3 ceph-osd[647313]:   -142> 2024-07-17T20:09:54.845-0300
702356d616c0 -1 osd.2 71 log_to_monitors true
Jul 17 20:09:55 pxm3 ceph-osd[647313]: -2> 2024-07-17T20:09:55.362-0300
702342c006c0 -1 osd.2 71 ERROR: bad fsid?  i have
5514a69a-46ba-4a44-bb56-8d3109c6c9e0 and inc has
f4466e33-b57d-4d68-9909-346>
Jul 17 20:09:55 pxm3 ceph-osd[647313]: -1> 2024-07-17T20:09:55.366-0300
702342c006c0 -1 ./src/osd/OSD.cc: In function 'void
OSD::handle_osd_map(MOSDMap*)' thread 702342c006c0 time 2024-07-17T20:09:5>
Jul 17 20:09:55 pxm3 ceph-osd[647313]: ./src/osd/OSD.cc: 8098:
ceph_abort_msg("bad fsid")
Jul 17 20:09:55 pxm3 ceph-osd[647313]:  ceph version 18.2.2
(e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
---

Above in the OSDs daemon log, we see the "bad fsid" error which indicates
that the OSDs are trying to connect to a different Ceph cluster than the
one they were initially configured for. Each Ceph cluster has a unique
identifier, the FSID, and if the OSDs detect a different FSID, they cannot
connect correctly.
Note: I believe this is the root cause of the problem and the solution.
Problem: OSD has fsid from the old cluster.
This would be the solution, change the ceph_fsid (fsid) of the OSDs to the
fsid of the new cluster, or recreate a new cluster using the fsid that is
recorded in the OSD metadata, that is, recreate the cluster with the fsid
of the old cluster? would it be possible?)

I also opened a topic for this case on the proxmox forum and inserted a lot
of data about this scenario there:
https://forum.proxmox.com/threads/ceph-cluster-rebuild-import-bluestore-osds-from-old-cluster-bad-fsid-osd-dont-start-he-only-stays-in-down-state.151349/

-
~# cat /etc/ceph/ceph.conf

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 192.168.0.2/24
fsid = f4466e33-b57d-4d68-9909-3468afd9e5c2
mon_allow_pool_delete = true
mon_host = 192.168.0.2 192.168.0.3 192.168.0.1
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 192.168.0.0/24

[client]
 #   keyring = /etc/pve/priv/$cluster.$name.keyring
  keyring = /etc/pve/priv/ceph.client.admin.keyring

#[client.crash]

[ceph-users] Re: ceph pg stuck active+remapped+backfilling

2024-08-02 Thread Eugen Block
First, I would restart the active mgr, the current status might be  
outdated, I've seen that lots of times. If the pg is still in remapped  
state, you'll need to provide a lot more information about your  
cluster, the current osd tree, ceph status, the applied crush rule  
etc. One possible root cause is that crush gives up too soon, this can  
be adjusted (increase set_choose_tries) in the rule.



Zitat von Jorge Garcia :


We were having an OSD reporting lots of errors, so I tried to remove
it by doing:

  ceph orch osd rm 139 --zap

It started moving all the data. Eventually, we got to the point that
there's only 1 pg backfilling, but that seems to be stuck now. I think
it may be because, in the process, another OSD (103) started reporting
errors, too. The pool is erasure k:5 m:2, so it should still be OK. I
don't see any progress happening on the backfill, and ceph -s has been
reporting "69797/925734949 objects misplaced (0.008%)" for days now.
How can I get it to finish the backfilling, or at least find out why
it's not working?

ceph version: quincy

Thanks!

Jorge
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph orchestrator upgrade quincy to reef, missing ceph-exporter

2024-08-02 Thread Frank de Bot (lists)

Hi,

When upgrading a cephadm deployed quincy cluster to reef, there will be 
no ceph-exporter service launched.


Being new in reef (from release notes: ceph-exporter: Now the 
performance metrics for Ceph daemons are exported by ceph-exporter, 
which deploys on each daemon rather than using prometheus exporter. This 
will reduce performance bottlenecks. ), metrics will be missing when the 
ceph-exporter service is not applied after the upgrade.


Is it intentional it's not added after an upgrade by orchestrator? A 
ceph orch apply ceph-exporter solves it, but I think it could be 
included in the upgraded since it may not be clear at first why metrics 
in grafana are missing.


Regards,

Frank de Bot
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Error When Replacing OSD - Please Help

2024-08-02 Thread Eugen Block

Hi,

is your cluster managed by cephadm? Because you refer to the manual  
procedure in the docs, but they are probably referring to pre-cephadm  
times when you had to use ceph-volume directly. If your cluster is  
managed by cephadm I wouldn't intervene manually when the orchestrator  
can help with these tasks. See [2] for more details.


[2] https://docs.ceph.com/en/reef/cephadm/services/osd/#remove-an-osd

Zitat von duluxoz :


Hi All,

I'm trying to replace an OSD in our cluster.

This is on Reef 18.2.2 on Rocky 9.4.

I performed the following steps (from this page of the Ceph Doco:  
https://docs.ceph.com/en/reef/rados/operations/add-or-rm-osds/):


1. Make sure that it is safe to destroy the OSD:
   `while!cephosdsafe-to-destroyosd.{id};dosleep10;done`
2. Destroy the OSD: `cephosddestroy0--yes-i-really-mean-it`
3. Replaced the HDD
4. Prepare the disk for replacement by using the ID of the OSD that was
   destroyed in previous steps:
   `ceph-volumelvmprepare--osd-id0--data/dev/sd3`

However, at this point I get the following errors:

~~~

Running command: /usr/bin/ceph --cluster ceph --name  
client.bootstrap-osd --keyring  
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
 stderr: 2024-08-02T13:52:58.812+1000 7ff26e904640 -1 auth: unable  
to find a keyring on  
/etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or  
directory
 stderr: 2024-08-02T13:52:58.812+1000 7ff26e904640 -1  
AuthRegistry(0x7ff268063e88) no keyring found at  
/etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling  
cephx
 stderr: 2024-08-02T13:52:58.817+1000 7ff26e904640 -1 auth: unable  
to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2)  
No such file or directory
 stderr: 2024-08-02T13:52:58.817+1000 7ff26e904640 -1  
AuthRegistry(0x7ff268063e88) no keyring found at  
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-02T13:52:58.819+1000 7ff26e904640 -1 auth: unable  
to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2)  
No such file or directory
 stderr: 2024-08-02T13:52:58.819+1000 7ff26e904640 -1  
AuthRegistry(0x7ff268065b00) no keyring found at  
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
 stderr: 2024-08-02T13:52:58.821+1000 7ff26e904640 -1 auth: unable  
to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2)  
No such file or directory
 stderr: 2024-08-02T13:52:58.821+1000 7ff26e904640 -1  
AuthRegistry(0x7ff26e9030c0) no keyring found at  
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx

 stderr: [errno 2] RADOS object not found (error connecting to the cluster)
-->  RuntimeError: Unable check if OSD id exists: 0

~~~

I can see a `client.bootstap-osd` user in the Ceph Dashboard Ceph  
User List, so I'm not sure what's going on.


Any help is greatly appreciated - thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS failing because of corrupted dentries in lost+found after update from 17.2.7 to 18.2.0

2024-08-02 Thread Dhairya Parmar
Hi Justin,

You should able to delete inodes from the lost+found dirs just by simply
`sudo rm -rf lost+found/`

What do you get when you try to delete? Do you get `EROFS`?

On Fri, Aug 2, 2024 at 8:42 AM Justin Lee  wrote:

> After we updated our ceph cluster from 17.2.7 to 18.2.0 the MDS kept being
> marked as damaged and stuck in up:standby with these errors in the log.
>
> debug-12> 2024-07-14T21:22:19.962+ 7f020cf3a700  1
> mds.0.cache.den(0x4 1000b3bcfea) loaded already corrupt dentry:
> [dentry #0x1/lost+found/1000b3bcfea [head,head] rep@0.0 NULL (dversion
> lock) pv=0 v=2 ino=(nil) state=0 0x558ca63b6500]
> debug-11> 2024-07-14T21:22:19.962+ 7f020cf3a700 10
> mds.0.cache.dir(0x4) go_bad_dentry 1000b3bcfea
>
> these log lines are repeated a bunch of times in our MDS logs, all on
> dentries that are within the lost+found directory. After reading this
> mailing
> list post , we
> tried setting ceph config set mds mds_go_bad_corrupt_dentry false. This
> seemed to successfully circumvent the issue, however, after a few seconds
> our MDS crashes. Our 3 MDS are now stuck in a cycle of active -> crash ->
> standby -> back to active. Because of this our actual ceph fs is extremely
> laggy.
>
> We read here  that
> reef now makes it possible to delete the lost+found directory, which might
> solve our problem, but it is inaccessible, to cd, ls, rm, etc.
>
> Has anyone seen this type of issue or know how to solve it? Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm Offline Bootstrapping Issue

2024-08-02 Thread Eugen Block

Hi,

I haven't seen that one yet. Can you show the output from these commands?

ceph orch client-keyring ls
ceph orch client-keyring set client.admin label:_admin

Is there anything helpful in the mgr log?

Zitat von "Alex Hussein-Kershaw (HE/HIM)" :


Hi,

I'm hitting an issue doing an offline install of Ceph 18.2.2 using cephadm.

Long output below... any advice is appreciated.

Looks like we don't managed to add admin labels (but also trying  
with --skip-admin results in a similar health warning).


Subsequently trying to add an OSD fails quietly, I assume because  
cephadm is unhappy.


Thanks,
Alex

$  sudo  cephadm --image "ceph/ceph:v18.2.2" --docker bootstrap   
--mon-ip `hostname -I` --skip-pull --ssh-user qs-admin  
--ssh-private-key /home/qs-admin/.ssh/id_rsa --ssh-public-key  
/home/qs-admin/.ssh/id_rsa.pub  --skip-dashboard

Verifying ssh connectivity using standard pubkey authentication ...
Adding key to qs-admin@localhost authorized_keys...
key already in qs-admin@localhost authorized_keys...
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: 65bee110-3ae6-11ef-a1de-005056013d88
Verifying IP 10.235.22.8 port 3300 ...
Verifying IP 10.235.22.8 port 6789 ...
Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
Internal network (--cluster-network) has not been provided, OSD  
replication will default to the public_network
Ceph version: ceph version 18.2.2  
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting public_network to 10.235.16.0/20 in mon config section
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 0.0.0.0:9283 ...
Verifying port 0.0.0.0:8765 ...
Verifying port 0.0.0.0:8443 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr not available, waiting (4/15)...
mgr not available, waiting (5/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Using provided ssh keys...
Adding key to qs-admin@localhost authorized_keys...
key already in qs-admin@localhost authorized_keys...
Adding host starlight-1...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
Deploying ceph-exporter service with default placement...
Deploying prometheus service with default placement...
Deploying grafana service with default placement...
Deploying node-exporter service with default placement...
Deploying alertmanager service with default placement...
Enabling client.admin keyring and conf on hosts with "admin" label
Non-zero exit code 5 from /usr/bin/docker run --rm --ipc=host  
--stop-signal=SIGTERM --ulimit nofile=1048576 --net=host  
--entrypoint /usr/bin/ceph --init -e  
CONTAINER_IMAGE=ceph/ceph:v18.2.2 -e NODE_NAME=starlight-1 -e  
CEPH_USE_RANDOM_NONCE=1 -v  
/var/log/ceph/65bee110-3ae6-11ef-a1de-005056013d88:/var/log/ceph:z  
-v /tmp/ceph-tmpxbngx708:/etc/ceph/ceph.client.admin.keyring:z -v  
/tmp/ceph-tmp94g7iyn2:/etc/ceph/ceph.conf:z ceph/ceph:v18.2.2 orch  
client-keyring set client.admin label:_admin
/usr/bin/ceph: stderr Error EIO: Module 'cephadm' has experienced an  
error and cannot handle commands:  
ContainerInspectInfo(image_id='3c937764e6f5de1131b469dc69f0db09f8bd55cf6c983482cde518596d3dd0e5', ceph_version='ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)',  
repo_digests=[''])

Unable to set up "admin" label; assuming older version of Ceph
Saving cluster configuration to  
/var/lib/ceph/65bee110-3ae6-11ef-a1de-005056013d88/config directory

Enabling autotune for osd_memory_target
You can access the Ceph CLI as following in case of multi-cluster or  
non-default config:


sudo /usr/sbin/cephadm shell --fsid  
65bee110-3ae6-11ef-a1de-005056013d88 -c /etc/ceph/ceph.conf -k  
/etc/ceph/ceph.client.admin.keyring


Or, if you are only running a single cluster on this host:

sudo /usr/sbin/cephadm shell

Please consider enabling telemetry to help improve Ceph:

ceph telemetry on

For more information see:

https://docs.ceph.com/en/latest/mgr/telemetry/

Bootstrap complete.


]$ sudo docker exec  

[ceph-users] Re: Can you return orphaned objects to a bucket?

2024-08-02 Thread Frédéric Nass
Hello,

Not sure this exactly matches your case but you could try to reindex those 
orphan objects with 'radosgw-admin object reindex --bucket {bucket_name}'. See 
[1] for command arguments, like realm, zonegroup, zone, etc.
This command scans the data pool for objects that belong to a given bucket and 
add those objects back to the bucket index.

Same logic as rgw-restore-bucket-index [1][2] script that has proven to be 
successful in recovering bucket indexes destroyed by resharding [3] in the past.
'radosgw-admin object reindex --bucket {bucket_name}' may be faster than 
rgw-restore-bucket-index if I'm not mistaken.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/
[2] https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index
[3] https://github.com/ceph/ceph/pull/50329

- Le 10 Juil 24, à 15:24,  motahare...@gmail.com a écrit :

> Hello everyone,
> Is there any way to return orphaned objects to a bucket? A large bucket was
> accidentally corrupted and emptied, but its objects do exist in rados. bi list
> and bucket check returns empty list and bucket fix don't do anything as
> expected. How can we index associated rados objects back to the bucket?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS failing because of corrupted dentries in lost+found after update from 17.2.7 to 18.2.0

2024-08-02 Thread Dhairya Parmar
So the mount hung? Can you see anything suspicious in the logs?

On Fri, Aug 2, 2024 at 7:17 PM Justin Lee  wrote:

> Hi Dhairya,
>
> Thanks for the response! We tried removing it as you suggested with `rm
> -rf` but the command just hangs indefinitely with no output. We are also
> unable to `ls lost_found`, or otherwise interact with the directory's
> contents.
>
> Best,
> Justin lee
>
> On Fri, Aug 2, 2024 at 8:24 AM Dhairya Parmar  wrote:
>
>> Hi Justin,
>>
>> You should able to delete inodes from the lost+found dirs just by simply
>> `sudo rm -rf lost+found/`
>>
>> What do you get when you try to delete? Do you get `EROFS`?
>>
>> On Fri, Aug 2, 2024 at 8:42 AM Justin Lee 
>> wrote:
>>
>>> After we updated our ceph cluster from 17.2.7 to 18.2.0 the MDS kept
>>> being
>>> marked as damaged and stuck in up:standby with these errors in the log.
>>>
>>> debug-12> 2024-07-14T21:22:19.962+ 7f020cf3a700  1
>>> mds.0.cache.den(0x4 1000b3bcfea) loaded already corrupt dentry:
>>> [dentry #0x1/lost+found/1000b3bcfea [head,head] rep@0.0 NULL (dversion
>>> lock) pv=0 v=2 ino=(nil) state=0 0x558ca63b6500]
>>> debug-11> 2024-07-14T21:22:19.962+ 7f020cf3a700 10
>>> mds.0.cache.dir(0x4) go_bad_dentry 1000b3bcfea
>>>
>>> these log lines are repeated a bunch of times in our MDS logs, all on
>>> dentries that are within the lost+found directory. After reading this
>>> mailing
>>> list post , we
>>> tried setting ceph config set mds mds_go_bad_corrupt_dentry false. This
>>> seemed to successfully circumvent the issue, however, after a few seconds
>>> our MDS crashes. Our 3 MDS are now stuck in a cycle of active -> crash ->
>>> standby -> back to active. Because of this our actual ceph fs is
>>> extremely
>>> laggy.
>>>
>>> We read here 
>>> that
>>> reef now makes it possible to delete the lost+found directory, which
>>> might
>>> solve our problem, but it is inaccessible, to cd, ls, rm, etc.
>>>
>>> Has anyone seen this type of issue or know how to solve it? Thanks!
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm Offline Bootstrapping Issue

2024-08-02 Thread Tim Holloway
You might want to to try my "bringing up an OSD really, really fast"
package (https://gogs.mousetech.com/mtsinc7/instant_osd).

It's actually for spinning up a VM with an OSD in it, although you can
skip the VM setup script if you're on a bare OS and just run the
Ansible part.

Apologies for anyone who tried to pull it last week, as lightning
destroyed the cable to my ISP and it took them 5 days to get me back on
the Internet. So much for having a business account.

One quirk. You may need to manually copy in a copy of you ceph osd-
bootstrap key to get the operation to complete. I'm not sure why, since
I'd expect cephadm to have dealt with that, and the key has to be
located in /etc/cepn, hot in /etc/ceph/. And may further
be impacted by the filesystem differences internal and external to the
cephadm shell. Which is mildly annoying, but not too bad. Someday I
hope to get that part locked down.

   Tim

On Fri, 2024-08-02 at 13:24 +, Eugen Block wrote:
> Hi,
> 
> I haven't seen that one yet. Can you show the output from these
> commands?
> 
> ceph orch client-keyring ls
> ceph orch client-keyring set client.admin label:_admin
> 
> Is there anything helpful in the mgr log?
> 
> Zitat von "Alex Hussein-Kershaw (HE/HIM)" :
> 
> > Hi,
> > 
> > I'm hitting an issue doing an offline install of Ceph 18.2.2 using
> > cephadm.
> > 
> > Long output below... any advice is appreciated.
> > 
> > Looks like we don't managed to add admin labels (but also trying  
> > with --skip-admin results in a similar health warning).
> > 
> > Subsequently trying to add an OSD fails quietly, I assume because  
> > cephadm is unhappy.
> > 
> > Thanks,
> > Alex
> > 
> > $  sudo  cephadm --image "ceph/ceph:v18.2.2" --docker bootstrap   
> > --mon-ip `hostname -I` --skip-pull --ssh-user qs-admin  
> > --ssh-private-key /home/qs-admin/.ssh/id_rsa --ssh-public-key  
> > /home/qs-admin/.ssh/id_rsa.pub  --skip-dashboard
> > Verifying ssh connectivity using standard pubkey authentication ...
> > Adding key to qs-admin@localhost authorized_keys...
> > key already in qs-admin@localhost authorized_keys...
> > Verifying podman|docker is present...
> > Verifying lvm2 is present...
> > Verifying time synchronization is in place...
> > Unit chronyd.service is enabled and running
> > Repeating the final host check...
> > docker (/usr/bin/docker) is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > Cluster fsid: 65bee110-3ae6-11ef-a1de-005056013d88
> > Verifying IP 10.235.22.8 port 3300 ...
> > Verifying IP 10.235.22.8 port 6789 ...
> > Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
> > Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
> > Internal network (--cluster-network) has not been provided, OSD  
> > replication will default to the public_network
> > Ceph version: ceph version 18.2.2  
> > (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
> > Extracting ceph user uid/gid from container image...
> > Creating initial keys...
> > Creating initial monmap...
> > Creating mon...
> > Waiting for mon to start...
> > Waiting for mon...
> > mon is available
> > Assimilating anything we can from ceph.conf...
> > Generating new minimal ceph.conf...
> > Restarting the monitor...
> > Setting public_network to 10.235.16.0/20 in mon config section
> > Wrote config to /etc/ceph/ceph.conf
> > Wrote keyring to /etc/ceph/ceph.client.admin.keyring
> > Creating mgr...
> > Verifying port 0.0.0.0:9283 ...
> > Verifying port 0.0.0.0:8765 ...
> > Verifying port 0.0.0.0:8443 ...
> > Waiting for mgr to start...
> > Waiting for mgr...
> > mgr not available, waiting (1/15)...
> > mgr not available, waiting (2/15)...
> > mgr not available, waiting (3/15)...
> > mgr not available, waiting (4/15)...
> > mgr not available, waiting (5/15)...
> > mgr is available
> > Enabling cephadm module...
> > Waiting for the mgr to restart...
> > Waiting for mgr epoch 5...
> > mgr epoch 5 is available
> > Setting orchestrator backend to cephadm...
> > Using provided ssh keys...
> > Adding key to qs-admin@localhost authorized_keys...
> > key already in qs-admin@localhost authorized_keys...
> > Adding host starlight-1...
> > Deploying mon service with default placement...
> > Deploying mgr service with default placement...
> > Deploying crash service with default placement...
> > Deploying ceph-exporter service with default placement...
> > Deploying prometheus service with default placement...
> > Deploying grafana service with default placement...
> > Deploying node-exporter service with default placement...
> > Deploying alertmanager service with default placement...
> > Enabling client.admin keyring and conf on hosts with "admin" label
> > Non-zero exit code 5 from /usr/bin/docker run --rm --ipc=host  
> > --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host  
> > --entrypoint /usr/bin/ceph --init -e  
> > CONTAINER_IMAGE=ceph/ceph:v18.2.2 -e NODE_NAME=starlight-1 -e  
> > CEPH_USE

[ceph-users] Re: Cephadm Offline Bootstrapping Issue

2024-08-02 Thread Adam King
The thing that stands out to me from that output was that the image has no
repo_digests. It's possible cephadm is expecting there to be digests and is
crashing out trying to grab them for this image. I think it's worth a try
to set mgr/cephadm/use_repo_digest to false, and then restart the mgr. FWIW
turning off that setting has resolved other issues related to disconnected
installs as well. It just means you should avoid using floating tags.

On Thu, Aug 1, 2024 at 11:19 PM Alex Hussein-Kershaw (HE/HIM) <
alex...@microsoft.com> wrote:

> Hi,
>
> I'm hitting an issue doing an offline install of Ceph 18.2.2 using cephadm.
>
> Long output below... any advice is appreciated.
>
> Looks like we don't managed to add admin labels (but also trying with
> --skip-admin results in a similar health warning).
>
> Subsequently trying to add an OSD fails quietly, I assume because cephadm
> is unhappy.
>
> Thanks,
> Alex
>
> $  sudo  cephadm --image "ceph/ceph:v18.2.2" --docker bootstrap  --mon-ip
> `hostname -I` --skip-pull --ssh-user qs-admin --ssh-private-key
> /home/qs-admin/.ssh/id_rsa --ssh-public-key /home/qs-admin/.ssh/id_rsa.pub
> --skip-dashboard
> Verifying ssh connectivity using standard pubkey authentication ...
> Adding key to qs-admin@localhost authorized_keys...
> key already in qs-admin@localhost authorized_keys...
> Verifying podman|docker is present...
> Verifying lvm2 is present...
> Verifying time synchronization is in place...
> Unit chronyd.service is enabled and running
> Repeating the final host check...
> docker (/usr/bin/docker) is present
> systemctl is present
> lvcreate is present
> Unit chronyd.service is enabled and running
> Host looks OK
> Cluster fsid: 65bee110-3ae6-11ef-a1de-005056013d88
> Verifying IP 10.235.22.8 port 3300 ...
> Verifying IP 10.235.22.8 port 6789 ...
> Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
> 
> Mon IP `10.235.22.8` is in CIDR network `10.235.16.0/20`
> 
> Internal network (--cluster-network) has not been provided, OSD
> replication will default to the public_network
> Ceph version: ceph version 18.2.2
> (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
> Extracting ceph user uid/gid from container image...
> Creating initial keys...
> Creating initial monmap...
> Creating mon...
> Waiting for mon to start...
> Waiting for mon...
> mon is available
> Assimilating anything we can from ceph.conf...
> Generating new minimal ceph.conf...
> Restarting the monitor...
> Setting public_network to 10.235.16.0/20 in mon config section
> Wrote config to /etc/ceph/ceph.conf
> Wrote keyring to /etc/ceph/ceph.client.admin.keyring
> Creating mgr...
> Verifying port 0.0.0.0:9283 ...
> Verifying port 0.0.0.0:8765 ...
> Verifying port 0.0.0.0:8443 ...
> Waiting for mgr to start...
> Waiting for mgr...
> mgr not available, waiting (1/15)...
> mgr not available, waiting (2/15)...
> mgr not available, waiting (3/15)...
> mgr not available, waiting (4/15)...
> mgr not available, waiting (5/15)...
> mgr is available
> Enabling cephadm module...
> Waiting for the mgr to restart...
> Waiting for mgr epoch 5...
> mgr epoch 5 is available
> Setting orchestrator backend to cephadm...
> Using provided ssh keys...
> Adding key to qs-admin@localhost authorized_keys...
> key already in qs-admin@localhost authorized_keys...
> Adding host starlight-1...
> Deploying mon service with default placement...
> Deploying mgr service with default placement...
> Deploying crash service with default placement...
> Deploying ceph-exporter service with default placement...
> Deploying prometheus service with default placement...
> Deploying grafana service with default placement...
> Deploying node-exporter service with default placement...
> Deploying alertmanager service with default placement...
> Enabling client.admin keyring and conf on hosts with "admin" label
> Non-zero exit code 5 from /usr/bin/docker run --rm --ipc=host
> --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint
> /usr/bin/ceph --init -e CONTAINER_IMAGE=ceph/ceph:v18.2.2 -e
> NODE_NAME=starlight-1 -e CEPH_USE_RANDOM_NONCE=1 -v
> /var/log/ceph/65bee110-3ae6-11ef-a1de-005056013d88:/var/log/ceph:z -v
> /tmp/ceph-tmpxbngx708:/etc/ceph/ceph.client.admin.keyring:z -v
> /tmp/ceph-tmp94g7iyn2:/etc/ceph/ceph.conf:z ceph/ceph:v18.2.2 orch
> client-keyring set client.admin label:_admin
> /usr/bin/ceph: stderr Error EIO: Module 'cephadm' has experienced an error
> and cannot handle commands:
> ContainerInspectInfo(image_id='3c937764e6f5de1131b469dc69f0db09f8bd55cf6c983482cde518596d3dd0e5',
> ceph_version='ceph version 18.2.2
> (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)',
> repo_digests=[''])
> Unable to set up "admin" label; assuming older version of Ceph
> Saving cluster configuration to
> /var/lib/ceph/65bee110-3ae6-11ef-a1de-005056013d88/config directory
> Enabling autotune for osd_memory_target
> You can access the Ceph CLI as following in case of mult

[ceph-users] Re: ceph orchestrator upgrade quincy to reef, missing ceph-exporter

2024-08-02 Thread Adam King
ceph-exporter should get deployed by default with new installations on
recent versions, but as a general principle we've avoided adding/removing
services from the cluster during an upgrade. There is perhaps a case for
this service in particular if the user also has the rest of the monitoring
stack deployed, but it would be a behavior change to upgrades I'd be
cautious about adding in.

On Fri, Aug 2, 2024 at 7:45 AM Frank de Bot (lists) 
wrote:

> Hi,
>
> When upgrading a cephadm deployed quincy cluster to reef, there will be
> no ceph-exporter service launched.
>
> Being new in reef (from release notes: ceph-exporter: Now the
> performance metrics for Ceph daemons are exported by ceph-exporter,
> which deploys on each daemon rather than using prometheus exporter. This
> will reduce performance bottlenecks. ), metrics will be missing when the
> ceph-exporter service is not applied after the upgrade.
>
> Is it intentional it's not added after an upgrade by orchestrator? A
> ceph orch apply ceph-exporter solves it, but I think it could be
> included in the upgraded since it may not be clear at first why metrics
> in grafana are missing.
>
> Regards,
>
> Frank de Bot
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io