date:20240223

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread Eugen Block

You still haven't provided any details (logs) of what happened. The  
short excerpt from yesterday isn't useful as it only shows the startup  
of the daemon.


Zitat von nguyenvand...@baoviet.com.vn:

Could you pls help me explain the status of volume: recovering ?  
what is it ? and do we need to wait for volume recovery progress  
finished ??

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] cephadm purge cluster does not work

2024-02-23 Thread Vahideh Alinouri

Hi Guys,

I faced an issue. When I wanted to purge, the cluster was not purged
using the below command:

ceph mgr module disable cephadm
cephadm rm-cluster --force --zap-osds --fsid 

The OSDs will remain. There should be some cleanup methods for the
whole cluster, not just MON nodes. Is there anything related to this?

Regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread nguyenvandiep

https://drive.google.com/file/d/1OIN5O2Vj0iWfEMJ2fyHN_xV6fpknBmym/view?usp=sharing

Pls check my mds log which generate by command

cephadm logs --name mds.cephfs.cephgw02.qqsavr --fsid 
258af72a-cff3-11eb-a261-d4f5ef25154c
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread Eugen Block


This seems to be the relevant stack trace:

---snip---
Feb 23 15:18:39 cephgw02 conmon[2158052]: debug -1>  
2024-02-23T08:18:39.609+ 7fccc03c0700 -1  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h: In function 'std::ostream& operator<<(std::ostream&, const ClientMetricType&)' thread 7fccc03c0700 time  
2024-02-23T08:18:39.609581+
Feb 23 15:18:39 cephgw02 conmon[2158052]:  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h: 56: ceph_abort_msg("abort()  
called")

Feb 23 15:18:39 cephgw02 conmon[2158052]:
Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4  
(3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
Feb 23 15:18:39 cephgw02 conmon[2158052]:  1: (ceph::__ceph_abort(char  
const*, int, char const*, std::__cxx11::basic_stringstd::char_traits, std::allocator > const&)+0xe5)  
[0x7fccc9021cdc]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  2:  
(operator<<(std::ostream&, ClientMetricType const&)+0x10e)  
[0x7fccc92a642e]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  3:  
(MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  4:  
(DispatchQueue::pre_dispatch(boost::intrusive_ptr  
const&)+0x710) [0x7fccc9259c30]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  5:  
(DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:  
(DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:  
/lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]

Feb 23 15:18:39 cephgw02 conmon[2158052]:  8: clone()
Feb 23 15:18:39 cephgw02 conmon[2158052]:
Feb 23 15:18:39 cephgw02 conmon[2158052]: debug  0>  
2024-02-23T08:18:39.610+ 7fccc03c0700 -1 *** Caught signal  
(Aborted) **
Feb 23 15:18:39 cephgw02 conmon[2158052]:  in thread 7fccc03c0700  
thread_name:ms_dispatch

Feb 23 15:18:39 cephgw02 conmon[2158052]:
Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4  
(3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
Feb 23 15:18:39 cephgw02 conmon[2158052]:  1:  
/lib64/libpthread.so.0(+0x12b20) [0x7fccc7dcdb20]

Feb 23 15:18:39 cephgw02 conmon[2158052]:  2: gsignal()
Feb 23 15:18:39 cephgw02 conmon[2158052]:  3: abort()
Feb 23 15:18:39 cephgw02 conmon[2158052]:  4: (ceph::__ceph_abort(char  
const*, int, char const*, std::__cxx11::basic_stringstd::char_traits, std::allocator > const&)+0x1b6)  
[0x7fccc9021dad]

Feb 23 15:18:39 cephgw02 conmon[2158052]:  5: (opera
Feb 23 15:18:39 cephgw02 conmon[2158052]: tor<<(std::ostream&,  
ClientMetricType const&)+0x10e) [0x7fccc92a642e]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:  
(MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:  
(DispatchQueue::pre_dispatch(boost::intrusive_ptr  
const&)+0x710) [0x7fccc9259c30]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  8:  
(DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  9:  
(DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
Feb 23 15:18:39 cephgw02 conmon[2158052]:  10:  
/lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]

Feb 23 15:18:39 cephgw02 conmon[2158052]:  11: clone()
---snip---

But I can't really help here, hopefully someone else can chime in and  
interpret it.



Zitat von nguyenvand...@baoviet.com.vn:


https://drive.google.com/file/d/1OIN5O2Vj0iWfEMJ2fyHN_xV6fpknBmym/view?usp=sharing

Pls check my mds log which generate by command

cephadm logs --name mds.cephfs.cephgw02.qqsavr --fsid  
258af72a-cff3-11eb-a261-d4f5ef25154c

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm purge cluster does not work

2024-02-23 Thread Eugen Block

Which ceph version is this? In a small Reef test cluster this works as  
expected:


# cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a  
--zap-osds --force
Using recent ceph image  
registry.cloud.hh.nde.ag/ebl/ceph-upstream@sha256:057e08bf8d2d20742173a571bc28b65674b055bebe5f4c6cd488c1a6fd51f685

Zapping /dev/sdb...
Zapping /dev/sdc...
Zapping /dev/sdd...

and lsblk shows empty drives.

Zitat von Vahideh Alinouri :


Hi Guys,

I faced an issue. When I wanted to purge, the cluster was not purged
using the below command:

ceph mgr module disable cephadm
cephadm rm-cluster --force --zap-osds --fsid 

The OSDs will remain. There should be some cleanup methods for the
whole cluster, not just MON nodes. Is there anything related to this?

Regards
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

2024-02-23 Thread Kai Stian Olstad


Hi,

No one have any comment at all?
I'm not picky so any speculation, guessing, I would, I wouldn't, should 
work and so one would be highly appreciated.



Since 4 out of 6 in EC 4+2 is OK and ceph pg repair doesn't solve it I 
think the following might work.


pg 404.bc acting [223,297,269,276,136,197]

- Use pgremapper to move all PG on OSD 223 and 269 except 404.bc to 
other OSD.
- Set min_since to 4, ceph osd pool set default.rgw.buckets.data 
min_size 4

- Stop osd 223 and 269

What I hope will happen is that Ceph then recreate 404.bc shard 
s0(osd.223) and s2(osd.269) since they are now down from the remaining 
shards

s1(osd.297), s3(osd.276), s4(osd.136) and s5(osd.197)


_Any_ comment is highly appreciated.

-
Kai Stian Olstad


On 21.02.2024 13:27, Kai Stian Olstad wrote:

Hi,

Short summary

PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 
objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on the PG 
after repair is finished, it still report scrub errors.


Why can't ceph pg repair repair this, it has 4 out of 6 should be able 
to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's forced 
to recreate them?



Long detailed summary

A short backstory.
* This is aftermath of problems with mclock, post "17.2.7: Backfilling 
deadlock / stall / stuck / standstill" [1].

  - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
  - Solution was to swap from mclock to wpq and restart alle OSD.
  - When all backfilling was finished all 4 OSD was replaced.
  - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.


PG / pool 404 is EC 4+2 default.rgw.buckets.data

9 days after the osd.223 og osd.269 was replaced, deep-scub was run and 
reported errors

ceph status
---
HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg 
inconsistent

[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+inconsistent, acting 
[223,297,269,276,136,197]


I then run repair
ceph pg repair 404.bc

And ceph status showed this
ceph status
---
HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired

But osd.223 and osd.269 is new disks and the disks has no SMART error 
or any I/O error in OS logs.

So I tried to run deep-scrub again on the PG.
ceph pg deep-scrub 404.bc

And got this result.

ceph status
---
HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs; 
Possible data damage: 1 pg inconsistent

[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair, 
acting [223,297,269,276,136,197]


698 + 698 = 1396 so the same amount of errors.

Run repair again on 404.bc and ceph status is

HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 1396 reads repaired
osd.269 had 1396 reads repaired

So even when repair finish it doesn't fix the problem since they 
reappear again after a deep-scrub.


The log for osd.223 and osd.269 contain "got incorrect hash on read" 
and "candidate had an ec hash mismatch" for 698 unique objects.
But i only show the logs for 1 of the 698 object, the log is the same 
for the other 697 objects.


osd.223 log (only showing 1 of 698 object named 
2021-11-08T19%3a43%3a50,145489260+00%3a00)

---
Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch: 
231235 pg[404.bcs0( v 231235'1636919 (231078'1632435,231235'1636919] 
local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263 
les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) 
r=0 lpr=226263 crt=231235'1636919 lcod 231235'1636918 mlcod 
231235'1636918 active+clean+scrubbing+deep+inconsistent+repair [ 
404.bcs0:  REQ_SCRUB ]  MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned 
REQ_SCRUB] _scan_list  
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
got incorrect hash on read 0xc5d1dd1b !=  expected 0x7c2f86d7
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster) 
log [ERR] : 404.bc shard 223(0) soid 
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head 
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster) 
log [ERR] : 404.bc shard 269(2) soid 
404:3d001f95:::1f244892-a2e7-4

[ceph-users] MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Edouard FAZENDA

Dear Ceph Community,

 

I am having an issue with my Ceph Cluster , there were several osd crashing
but now active and recovery finished and now the CephFS filesystem cannot be
access by clients in RW (K8S worklod) as the 1 MDS is in Read-Only and 2 are
being on trimming 

 

The cephfs seems to have volume OK 

 

The trimming process seems not going further, maybe stuck ? 

 

We are running 3 hosts using ceph Pacific version 16.2.1

 

Here some logs on the situation :

 

ceph versions

{

"mon": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"mgr": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"osd": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 18

},

"mds": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"rgw": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 6

},

"overall": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 33

}

}

 

ceph orch ps

NAME  HOST   STATUS REFRESHED  AGE
PORTS  VERSION  IMAGE ID  CONTAINER ID

crash.rke-sh1-1   rke-sh1-1  running (21h)  36s ago21h  -
16.2.1   c757e4a3636b  e8652edb2b49

crash.rke-sh1-2   rke-sh1-2  running (21h)  3m ago 20M  -
16.2.1   c757e4a3636b  a1249a605ee0

crash.rke-sh1-3   rke-sh1-3  running (17h)  36s ago17h  -
16.2.1   c757e4a3636b  026667bc1776

mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1  running (18h)  36s ago4M   -
16.2.1   c757e4a3636b  9b4c2b08b759

mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2  running (18h)  3m ago 23M  -
16.2.1   c757e4a3636b  71681a5f34d3

mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3  running (17h)  36s ago3M   -
16.2.1   c757e4a3636b  e89946ad6b7e

mgr.rke-sh1-1.qskoyj  rke-sh1-1  running (21h)  36s ago2y
*:8082 *:9283  16.2.1   c757e4a3636b  7ce7cfbb3e55

mgr.rke-sh1-2.lxmguj  rke-sh1-2  running (21h)  3m ago 22M
*:8082 *:9283  16.2.1   c757e4a3636b  5a0025adfd46

mgr.rke-sh1-3.ckunvo  rke-sh1-3  running (17h)  36s ago6M
*:8082 *:9283  16.2.1   c757e4a3636b  2fcaf18f3218

mon.rke-sh1-1 rke-sh1-1  running (20h)  36s ago20h  -
16.2.1   c757e4a3636b  c0a90103cabc

mon.rke-sh1-2 rke-sh1-2  running (21h)  3m ago 3M   -
16.2.1   c757e4a3636b  f4b32ba4466b

mon.rke-sh1-3 rke-sh1-3  running (17h)  36s ago17h  -
16.2.1   c757e4a3636b  d5e44c245998

osd.0 rke-sh1-2  running (20h)  3m ago 2y   -
16.2.1   c757e4a3636b  7b0e69942c15

osd.1 rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  4451654d9a2d

osd.10rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  3f9d5f95e284

osd.11rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  db1cc6d2e37f

osd.12rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  de416c1ef766

osd.13rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  25a281cc5a9b

osd.14rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  62f25ba61667

osd.15rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  d3514d823c45

osd.16rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  bba857759bfe

osd.17rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  59281d4bb3d0

osd.2 rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  418041b5e60d

osd.3 rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  04a0e29d5623

osd.4 rke-sh1-1  running (20h)  36s ago2y   -
16.2.1   c757e4a3636b  1cc78a5153d3

osd.5 rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  39a4b11e31fb

osd.6 rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  2f218ffb566e

osd.7 rke-sh1-1  running (20h)  36s ago2y   -
16.2.1   c757e4a3636b  cf761fbe4d5f

osd.8 rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  f9f85480e800

osd.9 rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  664c54ff46d2

rgw.default.rke-sh1-1.dgucwl  rke-sh1-1  running (21h)  36s ago22M
*:8000 16.2.1   c757e4a3636b  f03212b955a7

rgw.default.rke-sh1-1.vylchc  rke-sh1-1  running (21h)  36s ago22M
*:8001 16.2.1   c757e4a3636b  da486ce43fe5

rgw.default.rke-sh1-2.dfhhfw  rke-sh1-2  runn

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Eugen Block


Hi,

the mds log should contain information why it goes into read-only  
mode. Just a few weeks ago I helped a user with a broken CephFS (MDS  
went into read-only mode because of missing objects in the journal).  
Can you check the journal status:


# cephfs-journal-tool --rank=cephfs:0 --journal=mdlog journal inspect

# cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal inspect

and also share the logs.

Thanks,
Eugen

Zitat von Edouard FAZENDA :


Dear Ceph Community,



I am having an issue with my Ceph Cluster , there were several osd crashing
but now active and recovery finished and now the CephFS filesystem cannot be
access by clients in RW (K8S worklod) as the 1 MDS is in Read-Only and 2 are
being on trimming



The cephfs seems to have volume OK



The trimming process seems not going further, maybe stuck ?



We are running 3 hosts using ceph Pacific version 16.2.1



Here some logs on the situation :



ceph versions

{

"mon": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"mgr": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"osd": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 18

},

"mds": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 3

},

"rgw": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 6

},

"overall": {

"ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10)
pacific (stable)": 33

}

}



ceph orch ps

NAME  HOST   STATUS REFRESHED  AGE
PORTS  VERSION  IMAGE ID  CONTAINER ID

crash.rke-sh1-1   rke-sh1-1  running (21h)  36s ago21h  -
16.2.1   c757e4a3636b  e8652edb2b49

crash.rke-sh1-2   rke-sh1-2  running (21h)  3m ago 20M  -
16.2.1   c757e4a3636b  a1249a605ee0

crash.rke-sh1-3   rke-sh1-3  running (17h)  36s ago17h  -
16.2.1   c757e4a3636b  026667bc1776

mds.cephfs.rke-sh1-1.ojmpnk   rke-sh1-1  running (18h)  36s ago4M   -
16.2.1   c757e4a3636b  9b4c2b08b759

mds.cephfs.rke-sh1-2.isqjza   rke-sh1-2  running (18h)  3m ago 23M  -
16.2.1   c757e4a3636b  71681a5f34d3

mds.cephfs.rke-sh1-3.vdicdn   rke-sh1-3  running (17h)  36s ago3M   -
16.2.1   c757e4a3636b  e89946ad6b7e

mgr.rke-sh1-1.qskoyj  rke-sh1-1  running (21h)  36s ago2y
*:8082 *:9283  16.2.1   c757e4a3636b  7ce7cfbb3e55

mgr.rke-sh1-2.lxmguj  rke-sh1-2  running (21h)  3m ago 22M
*:8082 *:9283  16.2.1   c757e4a3636b  5a0025adfd46

mgr.rke-sh1-3.ckunvo  rke-sh1-3  running (17h)  36s ago6M
*:8082 *:9283  16.2.1   c757e4a3636b  2fcaf18f3218

mon.rke-sh1-1 rke-sh1-1  running (20h)  36s ago20h  -
16.2.1   c757e4a3636b  c0a90103cabc

mon.rke-sh1-2 rke-sh1-2  running (21h)  3m ago 3M   -
16.2.1   c757e4a3636b  f4b32ba4466b

mon.rke-sh1-3 rke-sh1-3  running (17h)  36s ago17h  -
16.2.1   c757e4a3636b  d5e44c245998

osd.0 rke-sh1-2  running (20h)  3m ago 2y   -
16.2.1   c757e4a3636b  7b0e69942c15

osd.1 rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  4451654d9a2d

osd.10rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  3f9d5f95e284

osd.11rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  db1cc6d2e37f

osd.12rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  de416c1ef766

osd.13rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  25a281cc5a9b

osd.14rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  62f25ba61667

osd.15rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  d3514d823c45

osd.16rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  bba857759bfe

osd.17rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  59281d4bb3d0

osd.2 rke-sh1-1  running (21h)  36s ago2y   -
16.2.1   c757e4a3636b  418041b5e60d

osd.3 rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  04a0e29d5623

osd.4 rke-sh1-1  running (20h)  36s ago2y   -
16.2.1   c757e4a3636b  1cc78a5153d3

osd.5 rke-sh1-3  running (17h)  36s ago2y   -
16.2.1   c757e4a3636b  39a4b11e31fb

osd.6 rke-sh1-2  running (21h)  3m ago 2y   -
16.2.1   c757e4a3636b  2f218ffb566e

osd.7 rke-sh1-1  running (20h)  36s ago2y   -
16.2.1   c757e4a3636b  cf761fbe4d5f

osd.8 rke-sh1-3  running (17

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Edouard FAZENDA

Hi Eugen,

Thanks for the reply, really appreciate

The first command , just hang with no output 
# cephfs-journal-tool --rank=cephfs:0 --journal=mdlog journal inspect

The second command 

# cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal inspect
Overall journal integrity: OK

root@rke-sh1-2:~# cephadm logs --fsid fcb373ce-7aaa-11eb-984f-e7c6e0038e87 
--name mds.cephfs.rke-sh1-2.isqjza
-- Logs begin at Fri 2024-02-23 04:49:32 UTC, end at Fri 2024-02-23 13:08:22 
UTC. --
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: ignoring --setuser ceph since I am not 
root
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: ignoring --setgroup ceph since I am 
not root
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: starting mds.cephfs.rke-sh1-2.isqjza at
Feb 23 08:15:06 rke-sh1-2 bash[1058012]: debug 2024-02-23T08:15:06.371+ 
7fbc17dd9700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
Feb 23 08:15:13 rke-sh1-2 bash[1058012]: debug 2024-02-23T08:15:13.155+ 
7fbc145d2700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 
object, errno -22
Feb 23 08:15:13 rke-sh1-2 bash[1058012]: debug 2024-02-23T08:15:13.155+ 
7fbc145d2700 -1 mds.0.12487 unhandled write error (22) Invalid argument, force 
readonly...
Feb 23 10:20:36 rke-sh1-2 bash[1058012]: debug 2024-02-23T10:20:36.309+ 
7fbc17dd9700 -1 mds.pinger is_rank_lagging: rank=1 was never sent ping request.

root@rke-sh1-3:~# cephadm logs --fsid fcb373ce-7aaa-11eb-984f-e7c6e0038e87 
--name mds.cephfs.rke-sh1-3.vdicdn
-- Logs begin at Fri 2024-02-23 06:59:48 UTC, end at Fri 2024-02-23 13:09:18 
UTC. --
Feb 23 07:46:46 rke-sh1-3 bash[2901]: ignoring --setuser ceph since I am not 
root
Feb 23 07:46:46 rke-sh1-3 bash[2901]: ignoring --setgroup ceph since I am not 
root
Feb 23 07:46:46 rke-sh1-3 bash[2901]: starting mds.cephfs.rke-sh1-3.vdicdn at
Feb 23 10:25:51 rke-sh1-3 bash[2901]: ignoring --setuser ceph since I am not 
root
Feb 23 10:25:51 rke-sh1-3 bash[2901]: ignoring --setgroup ceph since I am not 
root
Feb 23 10:25:51 rke-sh1-3 bash[2901]: starting mds.cephfs.rke-sh1-3.vdicdn at

debug2: channel 0: request window-change confirm 0
debug3: send packet: type 98
-- Logs begin at Fri 2024-02-23 00:24:42 UTC, end at Fri 2024-02-23 13:09:55 
UTC. --
Feb 23 09:29:10 rke-sh1-1 bash[786820]: tcmalloc: large alloc 1073750016 bytes 
== 0x5598512de000 @  0x7fb426636760 0x7fb426657c64 0x5597c1ccaaba 
0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4 0x7fb41da6>
Feb 23 09:29:19 rke-sh1-1 bash[786820]: tcmalloc: large alloc 2147491840 bytes 
== 0x559891ae @  0x7fb426636760 0x7fb426657c64 0x5597c1ccaaba 
0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4 0x7fb41db3>
Feb 23 09:29:26 rke-sh1-1 bash[786820]: tcmalloc: large alloc 2147491840 bytes 
== 0x559951ae4000 @  0x7fb426636760 0x7fb426657c64 0x5597c1ccaaba 
0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4 0x7fb41da6>
Feb 23 09:29:27 rke-sh1-1 bash[786820]: debug 2024-02-23T09:29:27.928+ 
7fb416d63700 -1 asok(0x5597c3904000) AdminSocket: error writing response length 
(32) Broken pipe
Feb 23 12:35:53 rke-sh1-1 bash[786820]: ignoring --setuser ceph since I am not 
root
Feb 23 12:35:53 rke-sh1-1 bash[786820]: ignoring --setgroup ceph since I am not 
root
Feb 23 12:35:53 rke-sh1-1 bash[786820]: starting mds.cephfs.rke-sh1-1.ojmpnk at


The logs of the MDS are in verbose 20 , do you want me to provide on a archive 
? 

Is there a way to compact all the logs ? 

Best Regards, 

Edouard FAZENDA
Technical Support
 


Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40
 
www.csti.ch

-Original Message-
From: Eugen Block  
Sent: vendredi, 23 février 2024 12:50
To: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

Hi,

the mds log should contain information why it goes into read-only mode. Just a 
few weeks ago I helped a user with a broken CephFS (MDS went into read-only 
mode because of missing objects in the journal).  
Can you check the journal status:

# cephfs-journal-tool --rank=cephfs:0 --journal=mdlog journal inspect

# cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal inspect

and also share the logs.

Thanks,
Eugen

Zitat von Edouard FAZENDA :

> Dear Ceph Community,
>
>
>
> I am having an issue with my Ceph Cluster , there were several osd 
> crashing but now active and recovery finished and now the CephFS 
> filesystem cannot be access by clients in RW (K8S worklod) as the 1 
> MDS is in Read-Only and 2 are being on trimming
>
>
>
> The cephfs seems to have volume OK
>
>
>
> The trimming process seems not going further, maybe stuck ?
>
>
>
> We are running 3 hosts using ceph Pacific version 16.2.1
>
>
>
> Here some logs on the situation :
>
>
>
> ceph versions
>
> {
>
> "mon": {
>
> "ceph version 16.2.1 
> (afb9061ab4117f798c858c741efa6390e48ccf10)
> pacific (stable)": 3
>
> },
>
> "mgr": {
>
> "ceph version 16.2.1 
> (afb9061ab4117f798c858c741efa6390e48ccf10)
> pacific (stable)": 3
>
> },
>
>

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread David C.

Hi,
The problem seems to come from the clients (reconnect).

Test by disabling metrics on all clients:
echo Y > /sys/module/ceph/parameters/disable_send_metrics



Cordialement,

*David CASIER*





Le ven. 23 févr. 2024 à 10:20, Eugen Block  a écrit :

> This seems to be the relevant stack trace:
>
> ---snip---
> Feb 23 15:18:39 cephgw02 conmon[2158052]: debug -1>
> 2024-02-23T08:18:39.609+ 7fccc03c0700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h:
> In function 'std::ostream& operator<<(std::ostream&, const
> ClientMetricType&)' thread 7fccc03c0700 time
> 2024-02-23T08:18:39.609581+
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h:
> 56: ceph_abort_msg("abort()
> called")
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4
> (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  1: (ceph::__ceph_abort(char
> const*, int, char const*, std::__cxx11::basic_string std::char_traits, std::allocator > const&)+0xe5)
> [0x7fccc9021cdc]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  2:
> (operator<<(std::ostream&, ClientMetricType const&)+0x10e)
> [0x7fccc92a642e]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  3:
> (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  4:
> (DispatchQueue::pre_dispatch(boost::intrusive_ptr
> const&)+0x710) [0x7fccc9259c30]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  5:
> (DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:
> (DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:
> /lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  8: clone()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]: debug  0>
> 2024-02-23T08:18:39.610+ 7fccc03c0700 -1 *** Caught signal
> (Aborted) **
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  in thread 7fccc03c0700
> thread_name:ms_dispatch
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4
> (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  1:
> /lib64/libpthread.so.0(+0x12b20) [0x7fccc7dcdb20]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  2: gsignal()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  3: abort()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  4: (ceph::__ceph_abort(char
> const*, int, char const*, std::__cxx11::basic_string std::char_traits, std::allocator > const&)+0x1b6)
> [0x7fccc9021dad]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  5: (opera
> Feb 23 15:18:39 cephgw02 conmon[2158052]: tor<<(std::ostream&,
> ClientMetricType const&)+0x10e) [0x7fccc92a642e]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:
> (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:
> (DispatchQueue::pre_dispatch(boost::intrusive_ptr
> const&)+0x710) [0x7fccc9259c30]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  8:
> (DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  9:
> (DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  10:
> /lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  11: clone()
> ---snip---
>
> But I can't really help here, hopefully someone else can chime in and
> interpret it.
>
>
> Zitat von nguyenvand...@baoviet.com.vn:
>
> >
> https://drive.google.com/file/d/1OIN5O2Vj0iWfEMJ2fyHN_xV6fpknBmym/view?usp=sharing
> >
> > Pls check my mds log which generate by command
> >
> > cephadm logs --name mds.cephfs.cephgw02.qqsavr --fsid
> > 258af72a-cff3-11eb-a261-d4f5ef25154c
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Eugen Block

2024-02-23T08:15:13.155+ 7fbc145d2700 -1 log_channel(cluster)  
log [ERR] : failed to commit dir 0x1 object, errno -22
2024-02-23T08:15:13.155+ 7fbc145d2700 -1 mds.0.12487 unhandled  
write error (22) Invalid argument, force readonly...


Was your cephfs metadata pool full? This tracker  
(https://tracker.ceph.com/issues/52260) sounds very similar but I  
don't see a solution for it.



Zitat von Edouard FAZENDA :


Hi Eugen,

Thanks for the reply, really appreciate

The first command , just hang with no output
# cephfs-journal-tool --rank=cephfs:0 --journal=mdlog journal inspect

The second command

# cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal inspect
Overall journal integrity: OK

root@rke-sh1-2:~# cephadm logs --fsid  
fcb373ce-7aaa-11eb-984f-e7c6e0038e87 --name  
mds.cephfs.rke-sh1-2.isqjza
-- Logs begin at Fri 2024-02-23 04:49:32 UTC, end at Fri 2024-02-23  
13:08:22 UTC. --
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: ignoring --setuser ceph  
since I am not root
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: ignoring --setgroup ceph  
since I am not root
Feb 23 07:46:46 rke-sh1-2 bash[1058012]: starting  
mds.cephfs.rke-sh1-2.isqjza at
Feb 23 08:15:06 rke-sh1-2 bash[1058012]: debug  
2024-02-23T08:15:06.371+ 7fbc17dd9700 -1 mds.pinger  
is_rank_lagging: rank=0 was never sent ping request.
Feb 23 08:15:13 rke-sh1-2 bash[1058012]: debug  
2024-02-23T08:15:13.155+ 7fbc145d2700 -1 log_channel(cluster)  
log [ERR] : failed to commit dir 0x1 object, errno -22
Feb 23 08:15:13 rke-sh1-2 bash[1058012]: debug  
2024-02-23T08:15:13.155+ 7fbc145d2700 -1 mds.0.12487 unhandled  
write error (22) Invalid argument, force readonly...
Feb 23 10:20:36 rke-sh1-2 bash[1058012]: debug  
2024-02-23T10:20:36.309+ 7fbc17dd9700 -1 mds.pinger  
is_rank_lagging: rank=1 was never sent ping request.


root@rke-sh1-3:~# cephadm logs --fsid  
fcb373ce-7aaa-11eb-984f-e7c6e0038e87 --name  
mds.cephfs.rke-sh1-3.vdicdn
-- Logs begin at Fri 2024-02-23 06:59:48 UTC, end at Fri 2024-02-23  
13:09:18 UTC. --
Feb 23 07:46:46 rke-sh1-3 bash[2901]: ignoring --setuser ceph since  
I am not root
Feb 23 07:46:46 rke-sh1-3 bash[2901]: ignoring --setgroup ceph since  
I am not root

Feb 23 07:46:46 rke-sh1-3 bash[2901]: starting mds.cephfs.rke-sh1-3.vdicdn at
Feb 23 10:25:51 rke-sh1-3 bash[2901]: ignoring --setuser ceph since  
I am not root
Feb 23 10:25:51 rke-sh1-3 bash[2901]: ignoring --setgroup ceph since  
I am not root

Feb 23 10:25:51 rke-sh1-3 bash[2901]: starting mds.cephfs.rke-sh1-3.vdicdn at

debug2: channel 0: request window-change confirm 0
debug3: send packet: type 98
-- Logs begin at Fri 2024-02-23 00:24:42 UTC, end at Fri 2024-02-23  
13:09:55 UTC. --
Feb 23 09:29:10 rke-sh1-1 bash[786820]: tcmalloc: large alloc  
1073750016 bytes == 0x5598512de000 @  0x7fb426636760 0x7fb426657c64  
0x5597c1ccaaba 0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4  
0x7fb41da6>
Feb 23 09:29:19 rke-sh1-1 bash[786820]: tcmalloc: large alloc  
2147491840 bytes == 0x559891ae @  0x7fb426636760 0x7fb426657c64  
0x5597c1ccaaba 0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4  
0x7fb41db3>
Feb 23 09:29:26 rke-sh1-1 bash[786820]: tcmalloc: large alloc  
2147491840 bytes == 0x559951ae4000 @  0x7fb426636760 0x7fb426657c64  
0x5597c1ccaaba 0x7fb41bc04218 0x7fb41bc0ed5b 0x7fb41bbfeda4  
0x7fb41da6>
Feb 23 09:29:27 rke-sh1-1 bash[786820]: debug  
2024-02-23T09:29:27.928+ 7fb416d63700 -1 asok(0x5597c3904000)  
AdminSocket: error writing response length (32) Broken pipe
Feb 23 12:35:53 rke-sh1-1 bash[786820]: ignoring --setuser ceph  
since I am not root
Feb 23 12:35:53 rke-sh1-1 bash[786820]: ignoring --setgroup ceph  
since I am not root
Feb 23 12:35:53 rke-sh1-1 bash[786820]: starting  
mds.cephfs.rke-sh1-1.ojmpnk at



The logs of the MDS are in verbose 20 , do you want me to provide on  
a archive ?


Is there a way to compact all the logs ?

Best Regards,

Edouard FAZENDA
Technical Support



Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40

www.csti.ch

-Original Message-
From: Eugen Block 
Sent: vendredi, 23 février 2024 12:50
To: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

Hi,

the mds log should contain information why it goes into read-only  
mode. Just a few weeks ago I helped a user with a broken CephFS (MDS  
went into read-only mode because of missing objects in the journal).

Can you check the journal status:

# cephfs-journal-tool --rank=cephfs:0 --journal=mdlog journal inspect

# cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal inspect

and also share the logs.

Thanks,
Eugen

Zitat von Edouard FAZENDA :


Dear Ceph Community,



I am having an issue with my Ceph Cluster , there were several osd
crashing but now active and recovery finished and now the CephFS
filesystem cannot be access by clients in RW (K8S worklod) as the 1
MDS is in Read-Only and 2 are being on trimming



The cephfs seems to have volume OK



The trimming

[ceph-users] Re: cephadm purge cluster does not work

2024-02-23 Thread Eugen Block

It works for me on 17.2.6 as well. Could you be more specific what  
doesn't work for you? Running that command only removes the cluster  
configs etc. on that host, it does not orchestrate a removal on all  
hosts, not sure if you're aware of that.

Zitat von Vahideh Alinouri :

The version that has been installed is 17.2.5. But this method does not
work at all.

On Fri, Feb 23, 2024, 10:23 AM Eugen Block  wrote:

Which ceph version is this? In a small Reef test cluster this works as
expected:

# cephadm rm-cluster --fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
--zap-osds --force
Using recent ceph image

registry.cloud.hh.nde.ag/ebl/ceph-upstream@sha256:057e08bf8d2d20742173a571bc28b65674b055bebe5f4c6cd488c1a6fd51f685
Zapping /dev/sdb...
Zapping /dev/sdc...
Zapping /dev/sdd...

and lsblk shows empty drives.

Zitat von Vahideh Alinouri :

> Hi Guys,
>
> I faced an issue. When I wanted to purge, the cluster was not purged
> using the below command:
>
> ceph mgr module disable cephadm
> cephadm rm-cluster --force --zap-osds --fsid 
>
> The OSDs will remain. There should be some cleanup methods for the
> whole cluster, not just MON nodes. Is there anything related to this?
>
> Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Issue with Setting Public/Private Permissions for Bucket

2024-02-23 Thread asad . siddiqui

Hi Team,

I'm currently working with Ceph object storage and would like to understand how 
to set permissions to private or public on buckets/objects in Ceph object 
storage.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ambigous mds behind on trimming and slowops (ceph 17.2.5 and rook operator 1.10.8)

2024-02-23 Thread a . warkhade98

Team,

Guys,

We were facing cephFs volume mount issue and ceph status it was showing 
 mds slow requests
 Mds behind on trimming

After restarting mds pods it was resolved 
But wanted to know Root caus of this
It was started after 2 hours of one of the active mds was crashed 
So does that an active mds crash can cause this issue ?


Please provide your inputs anyone
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2024-02-23 Thread coupon405

Hi Reza,
I know this is a old thread, but I am running into a similar issue with the 
same error messages.  Were you able to get around the upgrade issue?  If so, 
what helped resolve it?


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Issue with Setting Public/Private Permissions for Bucket

2024-02-23 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)

https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html

From: asad.siddi...@rapidcompute.com At: 02/23/24 09:42:29 UTC-5:00To:  
ceph-users@ceph.io
Subject: [ceph-users] Issue with Setting Public/Private Permissions for Bucket

Hi Team,

I'm currently working with Ceph object storage and would like to understand how 
to set permissions to private or public on buckets/objects in Ceph object 
storage.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Query Regarding Calculating Ingress/Egress Traffic for Buckets via API

2024-02-23 Thread asad . siddiqui

Hi,

I am currently working on Ceph object storage and would like to inquire about 
how we can calculate the ingress and egress traffic for buckets/tenant via API.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Setting Alerts/Notifications for Full Buckets in Ceph Object Storage

2024-02-23 Thread asad . siddiqui

Hi 

I'm currently working with Ceph object storage (version Reef), and I'd like to 
know how we can set up alerts/notifications for buckets when they become full.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] CONFIGURE THE CEPH OBJECT GATEWAY

2024-02-23 Thread ashar . khan

I have configured Ceph S3 encryption successfully, and the configuration is 
created successfully. However, when I try to upload a file to the bucket, a 
failed request occurs. Could you please guide me on how to properly configure 
it?

I follow this link:
https://docs.ceph.com/en/quincy/radosgw/vault/#:~:text=aes256%2Dgcm96-,CONFIGURE%20THE%20CEPH%20OBJECT%20GATEWAY,-%EF%83%81
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: concept of ceph and 2 datacenters

2024-02-23 Thread ronny . lippold


hi vladimir,
thanks for answering ... of cause, we will build an 3 dc (tiebraker or 
server) setup.


i'm not sure, what to do with "disaster recovery".
is it real, that a ceph cluster can be completly broken?

kind regards,
ronny

--
Ronny Lippold
System Administrator

--
Spark 5 GmbH
Rheinstr. 97
64295 Darmstadt
Germany
--
Fon: +49-6151-8508-050
Fax: +49-6151-8508-111
Mail: ronny.lipp...@spark5.de
Web: https://www.spark5.de
--
Geschäftsführer: Henning Munte, Michael Mylius
Amtsgericht Darmstadt, HRB 7809
--

Am 2024-02-14 06:59, schrieb Vladimir Sigunov:

Hi Ronny,
This is a good starting point for your design.
https://docs.ceph.com/en/latest/rados/operations/stretch-mode/

My personal experience says that 2 DC Ceph deployment could suffer
from a 'split brain' situation. If you have any chance to create a 3
DC configuration, I would suggest to consider it. It could be more
expensive, but it definitely will be more reliable and fault tolerant.

Sincerely,
Vladimir

Get Outlook for Android

From: ronny.lipp...@spark5.de 
Sent: Tuesday, February 13, 2024 6:50:50 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] concept of ceph and 2 datacenters

hi there,
i have a design/concept question, to see, whats outside and which kind
of redundancy you use.

actually, we use 2 ceph cluster with rbd-mirror to have an cold-standby
clone.
but, rbd mirror is not application consistend. so we cannot be sure,
that all vms (kvm/proxy) are running.
we also waste a lot of hardware.

so now, we think about one big cluster over the two datacenters (two
buildings).

my queston is, do you care about ceph redundancy or is one ceph with
backups enough for you?
i know, with ceph, we are aware of hdd or server failure. but, are
software failures a real scenario?

would be great, to get some ideas from you.
also about the bandwidth between the 2 datacenters.
we are using 2x 6 proxmox server with 2x6x9 osd (sas ssd).

thanks for help, my minds are rotating.

kind regards,
ronny


--
Ronny Lippold
System Administrator

--
Spark 5 GmbH
Rheinstr. 97
64295 Darmstadt
Germany
--
Fon: +49-6151-8508-050
Fax: +49-6151-8508-111
Mail: ronny.lipp...@spark5.de
Web: https://www.spark5.de
--
Geschäftsführer: Henning Munte, Michael Mylius
Amtsgericht Darmstadt, HRB 7809
--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph MDS randomly hangs when pg nums reduced

2024-02-23 Thread lokitingyi

Hi,

I have a CephFS cluster
```
> ceph -s

  cluster:
id: e78987f2-ef1c-11ed-897d-cf8c255417f0
health: HEALTH_WARN
85 pgs not deep-scrubbed in time
85 pgs not scrubbed in time

  services:
mon: 5 daemons, quorum 
datastone05,datastone06,datastone07,datastone10,datastone09 (age 2w)
mgr: datastone05.iitngk(active, since 2w), standbys: datastone06.wjppdy
mds: 2/2 daemons up, 1 hot standby
osd: 22 osds: 22 up (since 3d), 22 in (since 4w); 8 remapped pgs

  data:
volumes: 1/1 healthy
pools:   4 pools, 115 pgs
objects: 49.08M objects, 16 TiB
usage:   35 TiB used, 2.0 PiB / 2.1 PiB avail
pgs: 3807933/98160678 objects misplaced (3.879%)
 107 active+clean
 8   active+remapped+backfilling

  io:
client:   224 MiB/s rd, 79 MiB/s wr, 844 op/s rd, 33 op/s wr
recovery: 8.8 MiB/s, 24 objects/s
```

The pool and pg status

```
> ceph osd pool autoscale-status

POOLSIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO 
 EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
cephfs.myfs.meta  28802M2.0 2119T  0.   
   4.0  16  on False
cephfs.myfs.data  16743G2.0 2119T  0.0154   
   1.0  32  on False
rbd  19 2.0 2119T  0.   
   1.0  32  on False
.mgr   3840k2.0 2119T  0.   
   1.0   1  on False
```

The pool detail

```
> ceph osd pool ls detail

pool 1 'cephfs.myfs.meta' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 3639 lfor 
0/3639/3637 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 2 'cephfs.myfs.data' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 66 pgp_num 58 pg_num_target 32 pgp_num_target 32 autoscale_mode 
on last_change 5670 lfor 0/5661/5659 flags hashpspool,selfmanaged_snaps 
stripe_width 0 application cephfs
pool 3 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 486 lfor 0/486/478 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 1 pgp_num 1 autoscale_mode on last_change 39 flags hashpspool 
stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
```

When pg numbers reduce, the mds server would have a chance to hang.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] list topic shows endpoint url and username e password

2024-02-23 Thread Giada Malatesta


Hello everyone,

we are facing a problem regarding the topic operations to send 
notification, particularly when using amqp protocol.


We are using Ceph version 18.2.1. We have created a topic by giving as 
attributes all needed information and so the push-endpoint (in our case 
a rabbit endpoint that is used to collect notification messages). Then 
we have configured all the buckets in our cluster Ceph so that it is 
possible to send notification when some changes occur.


The problem regards particularly the list_topic operation: we noticed 
that any authenticated user is able to get a full list of the created 
topics and with them to get all the information, including endpoint,  
and so username and password and IP and port, when using the 
boto3.set_stream_logger(), which is not good for our goal since we do 
not want the users to know implementation details.


There is the possibility to solve this problem? Any help would be useful.

Thanks and best regards.

GM.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] What exactly does the osd pool repair funtion do?

2024-02-23 Thread Aleksander Pähn

What exactly does the osd pool repair function do?
Documentation is not clear.

Kind regards,
AP


This e-mail may contain information that is privileged or confidential. If you 
are not the intended recipient, please delete the e-mail and any attachments 
and notify us immediately.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG stuck at recovery

2024-02-23 Thread Leon Gao

For us we see this for both EC 3,2 and 3 way replication pools, but all on
HDD. Our SSD usage is very small though.

On Mon, Feb 19, 2024 at 10:18 PM Anthony D'Atri 
wrote:

>
>
> >> After wrangling with this myself, both with 17.2.7 and to an extent
> with 17.2.5, I'd like to follow up here and ask:
> >> Those who have experienced this, were the affected PGs
> >> * Part of an EC pool?
> >> * Part of an HDD pool?
> >> * Both?
> >
> > Both in my case, EC is 4+2 jerasure blaum_roth and the HDD is hybrid
> where DB is on SSD shared by 5 HDD.
> > And in your cases?
>
>
> EC 4,2, HDD-only.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Is a direct Octopus to Reef Upgrade Possible?

2024-02-23 Thread Alex Hussein-Kershaw (HE/HIM)

Hi ceph-users,

I currently use Ceph Octopus to provide CephFS & S3 Storage for our app 
servers, deployed in containers by ceph-ansible. I'm planning to take an 
upgrade to get off Ceph Octopus as it's EOL.

I'd love to go straight to reef, but vaguely remember reading a statement that 
only two major versions can be taken on upgrade. I've failed to find that 
statement again.

Is it possible to go directly from Octopus straight to Reef?

I think a sensible approach here is to first migrate our existing deployments 
to use cephadm, and then use cephadm to upgrade. Any advice on this very 
welcome.

Many thanks,
Alex

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-23 Thread florian . leduc

Hi,
A bit of history might help to understand why we have the cache tier. 

We run openstack on top ceph since many years now (started with mimic, then an 
upgrade to nautilus (years 2 ago) and today and upgrade to pacific). At the 
beginning of the setup, we used to have a mix of hdd+ssd devices in HCI mode 
for openstack nova. After the upgrade to nautilus, we made a hardware refresh 
with brand new NVME devices. And transitionned from mixed devices to nvme. But 
we were never able to evict all the data from the vms_cache pools (even with 
being aggressive with the eviction; the last resort would have been to stop all 
the virtual instances, and that was not an option for our customers), so we 
decided to move on and set cache-mode proxy and serve data with only nvme since 
then. And it's been like this for 1 years and a half. 

But today, after the upgrade, the situation is that we cannot query any stats 
(with ceph pg x.x query), rados query hangs, scrub hangs even though all PGs 
are "active+clean". and there is no client activity reported by the cluster. 
Recovery, and rebalance. Also some other commands hangs, ie: "ceph balancer 
status". 

--
bash-4.2$ ceph -s
  cluster:
id: 
health: HEALTH_WARN
mon is allowing insecure global_id reclaim
noscrub,nodeep-scrub,nosnaptrim flag(s) set
18432 slow ops, oldest one blocked for 7626 sec, daemons 
[osd.0,osd.1,osd.10,osd.11,osd.112,osd.113,osd.118,osd.119,osd.120,osd.122]... 
have slow ops.

  services:
mon: 3 daemons, quorum mon1,mon2,mon3(age 36m)
mgr: bm9612541(active, since 39m)
osd: 72 osds: 72 up (since 97m), 72 in (since 9h)
 flags noscrub,nodeep-scrub,nosnaptrim

  data:
pools:   8 pools, 2409 pgs
objects: 14.64M objects, 92 TiB
usage:   276 TiB used, 143 TiB / 419 TiB avail
pgs: 2409 active+clean
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Scrub stuck and 'pg has invalid (post-split) stat'

2024-02-23 Thread florian . leduc

Hello Eugen,

We used to have cache tiering (hdd+ssd) for openstack nova/glance in the past 
before we move to nvme hardware. But we were not able to evict all objects 
because it required to shutdown all virtual instances and then do the eviction. 
So we decided to set the cache mode to "proxy" and leave it as is until we get 
a time frame to shutdown all instances bound to this ceph cluster (but it never 
happened).

REgards,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Query Regarding Calculating Ingress/Egress Traffic for Buckets via API

2024-02-23 Thread Tobias Urdin

Hello,

You can use the RGW admin API (enabled_apis=admin,….)  and get the usage from 
there.
https://docs.ceph.com/en/latest/radosgw/adminops/

Best regards

> On 15 Feb 2024, at 06:48, asad.siddi...@rapidcompute.com wrote:
> 
> Hi,
> 
> I am currently working on Ceph object storage and would like to inquire about 
> how we can calculate the ingress and egress traffic for buckets/tenant via 
> API.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

2024-02-23 Thread Christian Rohmann


Hey ceph-users,

I just noticed issues with ceph-crash using the Debian /Ubuntu packages 
(package: ceph-base):


While the /var/lib/ceph/crash/posted folder is created by the package 
install,

it's not properly chowned to ceph:ceph by the postinst script.
This might also affect RPM based installs somehow, but I did not look 
into that.


I opened a bug report with all the details and two ideas to fix this: 
https://tracker.ceph.com/issues/64548



The wrong ownership causes ceph-crash to NOT work at all. I myself 
missed quite a few crash reports. All of them were just sitting around 
on the machines, but were reported right after I did


 chown ceph:ceph /var/lib/ceph/crash/posted
 systemctl restart ceph-crash.service

You might want to check if you might be affected as well.
Failing to post crashes to the local cluster results in them not being 
reported back via telemetry.



Regards

Christian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

2024-02-23 Thread Edouard FAZENDA

Dear Eugen,

We have followed the workaround here : 
https://tracker.ceph.com/issues/58082#note-11 

And the cluster goes healthy, K8S workload are back.

# ceph status
  cluster:
id: fcb373ce-7aaa-11eb-984f-e7c6e0038e87
health: HEALTH_OK

  services:
mon: 3 daemons, quorum rke-sh1-2,rke-sh1-1,rke-sh1-3 (age 87m)
mgr: rke-sh1-2.lxmguj(active, since 2h), standbys: rke-sh1-3.ckunvo, 
rke-sh1-1.qskoyj
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 18 osds: 18 up (since 88m), 18 in (since 24h)
rgw: 6 daemons active (3 hosts, 1 zones)

  data:
volumes: 1/1 healthy
pools:   11 pools, 737 pgs
objects: 9.98M objects, 4.8 TiB
usage:   10 TiB used, 16 TiB / 26 TiB avail
pgs: 737 active+clean

  io:
client:   226 MiB/s rd, 208 MiB/s wr, 109 op/s rd, 272 op/s wr

  progress:
Global Recovery Event (17m)
  [] (remaining: 2m)

# ceph fs status

cephfs - 33 clients
==
RANK  STATE MDS   ACTIVITY DNSINOS   
DIRS   CAPS
 0active  cephfs.rke-sh1-1.ojmpnk  Reqs:   51 /s   103k  97.7k  
11.2k  20.7k
0-s   standby-replay  cephfs.rke-sh1-2.isqjza  Evts:   30 /s 0  0  
0  0
  POOL TYPE USED  AVAIL
cephfs_metadata  metadata  49.9G  6685G
  cephfs_data  data8423G  6685G
  STANDBY MDS
cephfs.rke-sh1-3.vdicdn
MDS version: ceph version 16.2.1 (afb9061ab4117f798c858c741efa6390e48ccf10) 
pacific (stable)

# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META 
AVAIL%USE   VAR   PGS  STATUS
 2ssd  1.45549   1.0  1.5 TiB  562 GiB  557 GiB   13 MiB  4.6 GiB  929 
GiB  37.68  0.98   86  up
 4ssd  1.45549   1.0  1.5 TiB  566 GiB  561 GiB  9.1 MiB  4.9 GiB  925 
GiB  37.95  0.98   94  up
 7ssd  1.45549   1.0  1.5 TiB  590 GiB  584 GiB   16 MiB  5.5 GiB  901 
GiB  39.57  1.03   93  up
11ssd  1.45549   1.0  1.5 TiB  563 GiB  558 GiB   15 MiB  4.3 GiB  928 
GiB  37.75  0.98   93  up
14ssd  1.45549   1.0  1.5 TiB  575 GiB  570 GiB   11 MiB  4.8 GiB  916 
GiB  38.56  1.00   97  up
17ssd  1.45549   1.0  1.5 TiB  651 GiB  646 GiB   30 MiB  4.6 GiB  840 
GiB  43.67  1.13   95  up
 0ssd  1.45549   1.0  1.5 TiB  614 GiB  608 GiB   15 MiB  5.3 GiB  877 
GiB  41.18  1.07   98  up
 3ssd  1.45549   1.0  1.5 TiB  673 GiB  668 GiB   20 MiB  4.7 GiB  817 
GiB  45.16  1.17  105  up
 6ssd  1.45549   1.0  1.5 TiB  527 GiB  523 GiB   11 MiB  4.9 GiB  963 
GiB  35.39  0.92   86  up
 9ssd  1.45549   1.0  1.5 TiB  549 GiB  545 GiB   16 MiB  4.2 GiB  942 
GiB  36.83  0.95   88  up
12ssd  1.45549   1.0  1.5 TiB  551 GiB  546 GiB   11 MiB  4.4 GiB  940 
GiB  36.95  0.96   96  up
15ssd  1.45549   1.0  1.5 TiB  594 GiB  589 GiB   16 MiB  4.4 GiB  897 
GiB  39.83  1.03   84  up
 1ssd  1.45549   1.0  1.5 TiB  520 GiB  516 GiB   10 MiB  3.6 GiB  970 
GiB  34.89  0.90   87  up
 5ssd  1.45549   1.0  1.5 TiB  427 GiB  423 GiB  7.9 MiB  4.0 GiB  1.0 
TiB  28.64  0.74   74  up
 8ssd  1.45549   1.0  1.5 TiB  625 GiB  620 GiB   27 MiB  4.7 GiB  866 
GiB  41.92  1.09   97  up
10ssd  1.45549   1.0  1.5 TiB  562 GiB  557 GiB   12 MiB  5.1 GiB  929 
GiB  37.69  0.98   92  up
13ssd  1.45549   1.0  1.5 TiB  673 GiB  668 GiB  7.2 MiB  5.0 GiB  817 
GiB  45.15  1.17  101  up
16ssd  1.45549   1.0  1.5 TiB  534 GiB  530 GiB  5.7 MiB  3.5 GiB  957 
GiB  35.81  0.93   85  up
   TOTAL   26 TiB   10 TiB   10 TiB  254 MiB   82 GiB   16 
TiB  38.59
MIN/MAX VAR: 0.74/1.17  STDDEV: 3.88

Thanks for the help !

Best Regards,

Edouard FAZENDA
Technical Support
 


Chemin du Curé-Desclouds 2, CH-1226 THONEX  +41 (0)22 869 04 40
 
www.csti.ch

-Original Message-
From: Eugen Block  
Sent: vendredi, 23 février 2024 15:05
To: Edouard FAZENDA 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS in ReadOnly and 2 MDS behind on trimming

> 2024-02-23T08:15:13.155+ 7fbc145d2700 -1 log_channel(cluster) log 
> [ERR] : failed to commit dir 0x1 object, errno -22
> 2024-02-23T08:15:13.155+ 7fbc145d2700 -1 mds.0.12487 unhandled 
> write error (22) Invalid argument, force readonly...

Was your cephfs metadata pool full? This tracker
(https://tracker.ceph.com/issues/52260) sounds very similar but I don't see a 
solution for it.


Zitat von Edouard FAZENDA :

> Hi Eugen,
>
> Thanks for the reply, really appreciate
>
> The first command , just hang with no output # cephfs-journal-tool 
> --rank=cephfs:0 --journal=mdlog journal inspect
>
> The second command
>
> # cephfs-journal-tool --rank=cephfs:0 --journal=purge_queue journal 
> inspect Overall journal integrity: OK
>
> root@rke-sh1-2:~# cephadm logs --fsid
> fcb373ce-7aaa-11eb-984f-e7c6e0038e87 --name 
> mds.cephfs.rke-sh1-2.isqjza
> -- Logs begin at Fri 2024-02-23 04:49:32 UTC, en

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread nguyenvandiep

Thank you for your time :) Have a good day, sir
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread nguyenvandiep

Hi David,

Could you pls helo me understand,

Does it affect to RGW service ? And if something go bad, how can i rollback ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: list topic shows endpoint url and username e password

2024-02-23 Thread Casey Bodley

thanks Giada, i see that you created
https://tracker.ceph.com/issues/64547 for this

unfortunately, this topic metadata doesn't really have a permission
model at all. topics are shared across the entire tenant, and all
users have access to read/overwrite those topics

a lot of work was done for https://tracker.ceph.com/issues/62727 to
add topic ownership and permission policy, and those changes will be
in the squid release

i've cc'ed Yuval and Krunal who worked on that - could these changes
be reasonably backported to quincy and reef?

On Fri, Feb 23, 2024 at 9:59 AM Giada Malatesta
 wrote:
>
> Hello everyone,
>
> we are facing a problem regarding the topic operations to send
> notification, particularly when using amqp protocol.
>
> We are using Ceph version 18.2.1. We have created a topic by giving as
> attributes all needed information and so the push-endpoint (in our case
> a rabbit endpoint that is used to collect notification messages). Then
> we have configured all the buckets in our cluster Ceph so that it is
> possible to send notification when some changes occur.
>
> The problem regards particularly the list_topic operation: we noticed
> that any authenticated user is able to get a full list of the created
> topics and with them to get all the information, including endpoint,
> and so username and password and IP and port, when using the
> boto3.set_stream_logger(), which is not good for our goal since we do
> not want the users to know implementation details.
>
> There is the possibility to solve this problem? Any help would be useful.
>
> Thanks and best regards.
>
> GM.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread nguyenvandiep

And we  dont have parameter folder

cd /sys/module/ceph/
[root@cephgw01 ceph]# ls
coresize  holders  initsize  initstate  notes  refcnt  rhelversion  sections  
srcversion  taint  uevent

My Ceph is 16.2.4
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG stuck at recovery

2024-02-23 Thread Curt

ECC 2+2 & 4+2 HDD only.

On Tue, 20 Feb 2024, 00:25 Anthony D'Atri,  wrote:

> After wrangling with this myself, both with 17.2.7 and to an extent with
> 17.2.5, I'd like to follow up here and ask:
>
> Those who have experienced this, were the affected PGs
>
> * Part of an EC pool?
> * Part of an HDD pool?
> * Both?
>
>
> >
> > You don't say anything about the Ceph version you are running.
> > I had an similar issue with 17.2.7, and is seams to be an issue with
> mclock,
> > when I switch to wpq everything worked again.
> >
> > You can read more about it here
> >
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG
> >
> > - Kai Stian Olstad
> >
> >
> > On Tue, Feb 06, 2024 at 06:35:26AM -, LeonGao  wrote:
> >> Hi community
> >>
> >> We have a new Ceph cluster deployment with 100 nodes. When we are
> draining an OSD host from the cluster, we see a small amount of PGs that
> cannot make any progress to the end. From the logs and metrics, it seems
> like the recovery progress is stuck (0 recovery ops for several days).
> Would like to get some ideas on this. Re-peering and OSD restart do resolve
> to mitigate the issue but we want to get to the root cause of it as
> draining and recovery happen frequently.
> >>
> >> I have put some debugging information below. Any help is appreciated,
> thanks!
> >>
> >> ceph -s
> >>   pgs: 4210926/7380034104 objects misplaced (0.057%)
> >>41198 active+clean
> >>71active+remapped+backfilling
> >>12active+recovering
> >>
> >> One of the stuck PG:
> >> 6.38f1   active+remapped+backfilling [313,643,727]
>  313 [313,643,717] 313
> >>
> >> PG query result:
> >>
> >> ceph pg 6.38f1 query
> >> {
> >>   "snap_trimq": "[]",
> >>   "snap_trimq_len": 0,
> >>   "state": "active+remapped+backfilling",
> >>   "epoch": 246856,
> >>   "up": [
> >>   313,
> >>   643,
> >>   727
> >>   ],
> >>   "acting": [
> >>   313,
> >>   643,
> >>   717
> >>   ],
> >>   "backfill_targets": [
> >>   "727"
> >>   ],
> >>   "acting_recovery_backfill": [
> >>   "313",
> >>   "643",
> >>   "717",
> >>   "727"
> >>   ],
> >>   "info": {
> >>   "pgid": "6.38f1",
> >>   "last_update": "212333'38916",
> >>   "last_complete": "212333'38916",
> >>   "log_tail": "80608'37589",
> >>   "last_user_version": 38833,
> >>   "last_backfill": "MAX",
> >>   "purged_snaps": [],
> >>   "history": {
> >>   "epoch_created": 3726,
> >>   "epoch_pool_created": 3279,
> >>   "last_epoch_started": 243987,
> >>   "last_interval_started": 243986,
> >>   "last_epoch_clean": 220174,
> >>   "last_interval_clean": 220173,
> >>   "last_epoch_split": 3726,
> >>   "last_epoch_marked_full": 0,
> >>   "same_up_since": 238347,
> >>   "same_interval_since": 243986,
> >>   "same_primary_since": 3728,
> >>   "last_scrub": "212333'38916",
> >>   "last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >>   "last_deep_scrub": "212333'38916",
> >>   "last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >>   "last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >>   "prior_readable_until_ub": 0
> >>   },
> >>   "stats": {
> >>   "version": "212333'38916",
> >>   "reported_seq": 413425,
> >>   "reported_epoch": 246856,
> >>   "state": "active+remapped+backfilling",
> >>   "last_fresh": "2024-02-05T21:14:40.838785+",
> >>   "last_change": "2024-02-03T22:33:43.052272+",
> >>   "last_active": "2024-02-05T21:14:40.838785+",
> >>   "last_peered": "2024-02-05T21:14:40.838785+",
> >>   "last_clean": "2024-02-03T04:26:35.168232+",
> >>   "last_became_active": "2024-02-03T22:31:16.037823+",
> >>   "last_became_peered": "2024-02-03T22:31:16.037823+",
> >>   "last_unstale": "2024-02-05T21:14:40.838785+",
> >>   "last_undegraded": "2024-02-05T21:14:40.838785+",
> >>   "last_fullsized": "2024-02-05T21:14:40.838785+",
> >>   "mapping_epoch": 243986,
> >>   "log_start": "80608'37589",
> >>   "ondisk_log_start": "80608'37589",
> >>   "created": 3726,
> >>   "last_epoch_clean": 220174,
> >>   "parent": "0.0",
> >>   "parent_split_bits": 14,
> >>   "last_scrub": "212333'38916",
> >>   "last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >>   "last_deep_scrub": "212333'38916",
> >>   "last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >>   "last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >>   "objects_scrubbed": 17743,
> >>   "log_size": 1327,
> >>   "log_dups_size": 3000,
> >>

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread David C.

look at ALL cephfs kernel clients (no effect on RGW)

Le ven. 23 févr. 2024 à 16:38,  a écrit :

> And we  dont have parameter folder
>
> cd /sys/module/ceph/
> [root@cephgw01 ceph]# ls
> coresize  holders  initsize  initstate  notes  refcnt  rhelversion
> sections  srcversion  taint  uevent
>
> My Ceph is 16.2.4
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Scrubs Randomly Starting/Stopping

2024-02-23 Thread ashley

Have just upgraded a cluster from 17.2.7 to 18.2.1

Everything is working as expected apart from the amount of scrubs & deep scrubs 
is bouncing all over the place every second.

I have the value set to 1 per OSD but currently the cluster reckons one minute 
it’s doing 60+ scrubs, and then second this will drop to 40, then back to 70.

If I check the ceph live log’s I can see every second it’s reporting multiple 
PG’s starting either a scrub or deep scrub, it does not look like these are 
actually running as isn’t having a negative effect on the cluster’s performance.

Is this something to be expected off the back of the upgrade and should sort it 
self out?

A sample of the logs:

2024-02-24T00:41:20.055401+ osd.54 (osd.54) 3160 : cluster 0 12.9a 
deep-scrub starts
2024-02-24T00:41:19.658144+ osd.41 (osd.41) 4103 : cluster 0 12.cd 
deep-scrub starts
2024-02-24T00:41:19.823910+ osd.33 (osd.33) 5625 : cluster 0 12.ae 
deep-scrub starts
2024-02-24T00:41:19.846736+ osd.65 (osd.65) 3947 : cluster 0 12.53 
deep-scrub starts
2024-02-24T00:41:20.007331+ osd.20 (osd.20) 7214 : cluster 0 12.142 scrub 
starts
2024-02-24T00:41:20.114748+ osd.10 (osd.10) 6538 : cluster 0 12.2c 
deep-scrub starts
2024-02-24T00:41:20.247205+ osd.36 (osd.36) 4789 : cluster 0 12.16f 
deep-scrub starts
2024-02-24T00:41:20.908051+ osd.68 (osd.68) 3869 : cluster 0 12.d7 
deep-scrub starts
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread nguyenvandiep

Could you pls guide me more detail :( im very newbie in Ceph :(
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread Matthew Leonard (BLOOMBERG/ 120 PARK)

Can you send sudo ceph -s and sudo ceph health detail 

Sent from Bloomberg Professional for iPhone

- Original Message -
From: nguyenvand...@baoviet.com.vn
To: ceph-users@ceph.io
At: 02/23/24 20:27:53 UTC-05:00


Could you pls guide me more detail :( im very newbie in Ceph :(
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

39 matches

Mail list logo