[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-05-06 Thread Marc

> Hello! Any news?
>

Yes, it will be around 18° today, Israel was heckled at EU song contest ..
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to add new OSDs

2024-05-06 Thread Michael Baer


Thanks for the help!

I wanted to give an update on the resolution to the issues I was
having. I didn't realize that I had created several competing OSD
specifications via dashboard . By cleaning that up, OSD creation now is
working as expected.

-Mike

> On Tue, 23 Apr 2024 00:06:19 -, c...@mikesoffice.com said:

c> I'm trying to add a new storage host into a Ceph cluster (quincy
c> 17.2.6). The machine has boot drives, one free SSD and 10 HDDs. The
c> plan is to have each HDD be an OSD with a DB on a equal size lvm of
c> the SDD. This machine is newer but otherwise similar to other machines
c> already in the cluster that are setup and running the same way. But
c> I've been unable to add OSDs and unable to figure out why, or fix
c> it. I have some experience, but I'm not an expert and could be missing
c> something obvious. If anyone has any suggestions, I would appreciate
c> it.

c> I've tried to add OSDs a couple different ways.

c> Via the dashboard, this has worked fine for previous machines. And it
c> appears to succeed and gives no errors that I can find looking in
c> /var/log/ceph and dashboard logs. The OSDs are never created. In fact,
c> the drives still show up as available in Physical Disks and I can do
c> the same creation procedure repeatedly.

c> I've tried creating it in cephadm shell with the following, which has 
also worked in the past:
c> ceph orch daemon add osd
c> 
stor04.fqdn:data_devices=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde,/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj,/dev/sdk,db_devices=/dev/sda,osds_per_device=1
c> The command just hangs. Again I wasn't able to find any obvious
c> errors. Although this one did seem to cause some slow op errors from
c> the monitors that required restarting a monitor. And it could cause
c> errors with the dashboard locking up and having to restart the manager
c> as well.

c> And I've tried setting 'ceph orch apply osd --all-available-devices
c> --unmanaged=false' to let Ceph automatically add the drives. In the
c> past, this would cause Ceph to automatically add the drives as OSDs
c> but without having associated DBs on the SSD. The SSD would just be
c> another OSD. This time it appears to have no affect and similar to the
c> above, I wasn't able to find any obvious error feedback.

c> -Mike
c> ___
c> ceph-users mailing list -- ceph-users@ceph.io
c> To unsubscribe send an email to ceph-users-le...@ceph.io


-- 
Michael Baer
c...@mikesoffice.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw sync non-existent bucket ceph reef 18.2.2

2024-05-06 Thread Konstantin Larin
Hello Christopher,

We had something similar on Pacific multi-site.
The problem was in leftover bucket metadata in our case, and was solved
by "radosgw-admin metadata list ..." and "radosgw-admin metadata rm
..." on master, for a non-existent bucket.

Best regards,
Konstantin

On Tue, 2024-04-30 at 21:42 +, Christopher Durham wrote:
> 
> Hi,
> I have a reef cluster 18.2.2 on Rocky 8.9. This cluster has been
> upgraded from pacific->quincy->reef over the past few years. It is a
> multi site with one other cluster that works fine with s3/radosgw on
> both sides, with proper bidirectional data replication.
> On one of the master cluster's radosgw logs, I noticed a sync request
> regarding a deleted bucket. I am not sure when this error started,
> but I know that the bucket in question was deleted a long time before
> the upgrade to reef. Perhapsthis error existed prior to reef, I do
> not know. Here is the error in the radosgw log:
> :get_bucket_index_log_status ERROR:
> rgw_read_bucket_full_sync_status() on pipe{s={b=BUCKET_NAME:CLUSTERID
> ..., z=, az= ...},d={b=..,az=...}} returned ret=-2
> My understanding:
> s=source, d=destination, each of which is a tuple with the
> appropriate info necessary
> 
> This happens for BUCKET_NAME every few minutes. Said bucket does not
> exist on either side of the multisite, but did in the past.
> Any way I can force radosgw to stop trying to replicate?
> Thanks
> -Chris
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Off-Site monitor node over VPN

2024-05-06 Thread Stefan Pinter
Hi!

i hope someone can help us out here :)

We need to move from 3 datacenters to 2 datacenters (+ 1 small serverroom 
reachable via layer 3 VPN)

NOW we have a ceph-mon in each datacenter, which is fine. But we have to move 
and will only have 2 datacenters in the future (that are connected, so devices 
that are separated geographically can still communicate over their VLANs that 
they are used to have)

in order to still have 3 - to continue being quorate - we have the idea to only 
move 1 ceph-mon into a small separate datacenter. we only are able to connect 
it via VPN though and we'd need a separate IP network and the traffic would be 
over a Firewall and over VPN.

having a separate IP network should work, as mentioned here:

2.1.1.2 Monitoring nodes on different subnets

https://documentation.suse.com/ses/7/html/ses-all/storage-bp-hwreq.html

what could possibly go wrong here? can the added latency be a problem? which 
devices would need to be allowed to connect over VPN?

All OSDs < - vpn - > ceph-mon

thank you for any input!


Kind regards

Stefan


BearingPoint GmbH
Sitz: Wien
Firmenbuchgericht: Handelsgericht Wien
Firmenbuchnummer: FN 175524z

The information in this email is confidential and may be legally privileged. If 
you are not the intended recipient of this message, any review, disclosure, 
copying, distribution, retention, or any action taken or omitted to be taken in 
reliance on it is prohibited and may be unlawful. If you are not the intended 
recipient, please reply to or forward a copy of this message to the sender and 
delete the message, any attachments, and any copies thereof from your system.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Luminous OSDs failing with FAILED assert(clone_size.count(clone))

2024-05-06 Thread sergio . rabellino
Dear Ceph users,
 I'm pretty new on this list, but I've been using Ceph with satisfaction since 
2020. I faced some problems through these years consulting the list archive, 
but now we're down with a problem that seems without an answer.
After a power failure, we have a bunch of OSDs that during rebalance/refilling 
goes down with this error:

/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t 
SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 
17:18:40.680350
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED 
assert(clone_size.count(clone))

 ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7fdcd38f63ee]
 2: (()+0x287577) [0x7fdcd38f6577]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725]
 4: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x2c8) [0x555e696d8208]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0x1169) [0x555e6973f749]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x1018) [0x555e69743b98]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x36a) [0x555e695b07da]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x555e69813c99]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x52d) 
[0x555e695b220d]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
[0x7fdcd38fc516]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0]
 12: (()+0x76db) [0x7fdcd23e46db]
 13: (clone()+0x3f) [0x7fdcd13ad61f]

 -6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) **
 in thread 7fdcb2523700 thread_name:tp_osd_tp

And we're unable to understand what's happening. Yes, actually we're in 
Luminous but we planned to upgrade to Pacific in June, but before upgrading I 
believe it's important to have a positive health check.
The pools in error are EC pools.
Some hints ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT meeting notes May 6th 2024

2024-05-06 Thread Adam King
   - DigitalOcean credits


   - things to ask


   - what would promotional material require


   - how much are credits worth


   - Neha to ask


   - 19.1.0 centos9 container status


   - close to being ready


   - will be building centos 8 and 9 containers simultaneously


   - should test on orch and upgrade suites before publishing RC


   - should first RC be tested on LRC?


   - skip first RC on LRC


   - performance differences to 18.2.2 with cephfs being investigated


   - 18.2.3


   - fix for https://tracker.ceph.com/issues/65733 almost ready


   - will upgrade LRC to version with fix


   - will happen before 19.1.0 most likely
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Luminous OSDs failing with FAILED assert(clone_size.count(clone))

2024-05-06 Thread Rabellino Sergio
I'm sorry I did a little mistake: our release is mimic, obviously as 
stated in the logged error, and all the ceph stuffs are aligned to mimic.



Il 06/05/2024 10:04, sergio.rabell...@unito.it ha scritto:

Dear Ceph users,
  I'm pretty new on this list, but I've been using Ceph with satisfaction since 
2020. I faced some problems through these years consulting the list archive, 
but now we're down with a problem that seems without an answer.
After a power failure, we have a bunch of OSDs that during rebalance/refilling 
goes down with this error:

/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t 
SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 
17:18:40.680350
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED 
assert(clone_size.count(clone))

  ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7fdcd38f63ee]
  2: (()+0x287577) [0x7fdcd38f6577]
  3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725]
  4: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x2c8) [0x555e696d8208]
  5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0x1169) [0x555e6973f749]
  6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x1018) [0x555e69743b98]
  7: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x36a) [0x555e695b07da]
  8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x555e69813c99]
  9: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x52d) [0x555e695b220d]
  10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) 
[0x7fdcd38fc516]
  11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0]
  12: (()+0x76db) [0x7fdcd23e46db]
  13: (clone()+0x3f) [0x7fdcd13ad61f]

  -6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) **
  in thread 7fdcb2523700 thread_name:tp_osd_tp

And we're unable to understand what's happening. Yes, actually we're in 
Luminous but we planned to upgrade to Pacific in June, but before upgrading I 
believe it's important to have a positive health check.
The pools in error are EC pools.
Some hints ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
ing. Sergio Rabellino

Università degli Studi di Torino
Dipartimento di Informatica
Tecnico di Ricerca
Tel +39-0116706701 Fax +39-011751603
C.so Svizzera , 185 - 10149 - Torino



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS 17.2.7 crashes at rejoin

2024-05-06 Thread Robert Sander

Hi,

a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:

# ceph -s
  cluster:
id: f54eea86-265a-11eb-a5d0-457857ba5742
health: HEALTH_ERR
22 failed cephadm daemon(s)
2 filesystems are degraded
1 mds daemon damaged
insufficient standby MDS daemons available
 
  services:

mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, 
ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc
mds: 4/5 daemons up
osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
rgw: 12 daemons active (4 hosts, 1 zones)
 
  data:

volumes: 0/2 healthy, 2 recovering; 1 damaged
pools:   15 pools, 4897 pgs
objects: 195.64M objects, 195 TiB
usage:   617 TiB used, 527 TiB / 1.1 PiB avail
pgs: 4892 active+clean
 5active+clean+scrubbing+deep
 
  io:

client:   2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr

# ceph fs status
ABC - 4 clients
===
RANK   STATE  MDS ACTIVITY   DNSINOS   DIRS   CAPS
 0 failed
 1resolve  ABC.ceph04.lzlkdu   0  3  1  0
 2resolve  ABC.ppc721.rzfmyi   0  3  1  0
 3resolve  ABC.ceph04.jiepaw 249252 13  0
  POOL TYPE USED  AVAIL
cephfs.ABC.meta  metadata  33.0G   104T
cephfs.ABC.datadata 390T   104T
DEF - 154 clients
===
RANK  STATE MDS ACTIVITY   DNSINOS   DIRS   
CAPS
 0rejoin(laggy)  DEF.ceph06.etthum30.9k  30.8k  5084  0
  POOL TYPE USED  AVAIL
cephfs.DEF.meta  metadata   190G   104T
cephfs.DEF.datadata 118T   104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) 
quincy (stable)


The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.

The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:

  -92> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 handle_mds_map 
i am now mds.0.501522
   -91> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 handle_mds_map 
state change up:reconnect --> up:rejoin
   -90> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 rejoin_start
   -89> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 
rejoin_joint_start
   -88> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35bfece err -22/0
   -87> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x3671eb5 err -22/0
   -86> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35bfed3 err -22/0
   -85> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc94c err -22/0
   -84> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35b0274 err -22/0
   -83> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x3671eb5 err -22/0
   -82> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc94c err -22/0
   -81> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x3671ebd err -22/-22
   -80> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x3671ecd err -22/-22
   -79> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc9ea err -22/-22
   -78> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35bfed3 err -22/0
   -77> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc9c3 err -22/-22
   -76> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc978 err -22/-22
   -75> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc99d err -22/-22
   -74> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc95b err -22/-22
   -73> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc980 err -22/-22
   -72> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35b0274 err -22/0
   -71> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x20001dc7a7e err -22/-22
   -70> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012be364 err -22/-22
   -69> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x35b2e32 err -22/-22
   -68> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x3671eb5 err -22/0
   -67> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  failed to 
open ino 0x200012bc94c err -22/0
   -66> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  faile

[ceph-users] Reef: Dashboard: Object Gateway Graphs have no Data

2024-05-06 Thread Dave Hall
Hello.

We're running a containerized deployment of Reef with a focus on RGW.  We
noticed that while the Grafana graphs for other categories - OSDs, Pools,
etc - have data, the graphs for the Object Gateway category are empty.

I did some looking last week and found  reference to something about an
unintended limit on the number of graphs that could be rendered.  However,
I did not find reference to a fix.  Is it a setting that needs to be
changed, or is it a piece of code that is pending for the next fix
release?

A second - possibly related question:  There is an informational message on
the Object Gateway panel about how the default realm is not set.  We're not
using multi-site sync, so we haven't set the realm, but we're wondering if
we should set a default realm just to keep the code happy.  We're also
wondering if this is related in any way to the empty graphs described above.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS 17.2.7 crashes at rejoin

2024-05-06 Thread Xiubo Li

This is a known issue, please see https://tracker.ceph.com/issues/60986.

If you could reproduce it then please enable the mds debug logs and this 
could help debug it fast:


debug_mds = 25

debug_ms = 1

Thanks

- Xiubo



On 5/7/24 00:26, Robert Sander wrote:

Hi,

a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:

# ceph -s
  cluster:
    id: f54eea86-265a-11eb-a5d0-457857ba5742
    health: HEALTH_ERR
    22 failed cephadm daemon(s)
    2 filesystems are degraded
    1 mds daemon damaged
    insufficient standby MDS daemons available

  services:
    mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
    mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, 
ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc

    mds: 4/5 daemons up
    osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
    rgw: 12 daemons active (4 hosts, 1 zones)

  data:
    volumes: 0/2 healthy, 2 recovering; 1 damaged
    pools:   15 pools, 4897 pgs
    objects: 195.64M objects, 195 TiB
    usage:   617 TiB used, 527 TiB / 1.1 PiB avail
    pgs: 4892 active+clean
 5    active+clean+scrubbing+deep

  io:
    client:   2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr

# ceph fs status
ABC - 4 clients
===
RANK   STATE  MDS ACTIVITY   DNS    INOS 
DIRS   CAPS

 0 failed
 1    resolve  ABC.ceph04.lzlkdu   0  3 1  0
 2    resolve  ABC.ppc721.rzfmyi   0  3 1  0
 3    resolve  ABC.ceph04.jiepaw 249    252 13  0
  POOL TYPE USED  AVAIL
cephfs.ABC.meta  metadata  33.0G   104T
cephfs.ABC.data    data 390T   104T
DEF - 154 clients
===
RANK  STATE MDS ACTIVITY   DNS INOS   
DIRS   CAPS
 0    rejoin(laggy)  DEF.ceph06.etthum    30.9k  30.8k 
5084  0

  POOL TYPE USED  AVAIL
cephfs.DEF.meta  metadata   190G   104T
cephfs.DEF.data    data 118T   104T
MDS version: ceph version 17.2.7 
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)



The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.

The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:

  -92> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 
handle_mds_map i am now mds.0.501522
   -91> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 
handle_mds_map state change up:reconnect --> up:rejoin
   -90> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 
rejoin_start
   -89> 2024-05-06T16:07:15.514+ 7f1927e9d700  1 mds.0.501522 
rejoin_joint_start
   -88> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35bfece err -22/0
   -87> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x3671eb5 err -22/0
   -86> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35bfed3 err -22/0
   -85> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc94c err -22/0
   -84> 2024-05-06T16:07:15.514+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35b0274 err -22/0
   -83> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x3671eb5 err -22/0
   -82> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc94c err -22/0
   -81> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x3671ebd err -22/-22
   -80> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x3671ecd err -22/-22
   -79> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc9ea err -22/-22
   -78> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35bfed3 err -22/0
   -77> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc9c3 err -22/-22
   -76> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc978 err -22/-22
   -75> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc99d err -22/-22
   -74> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc95b err -22/-22
   -73> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012bc980 err -22/-22
   -72> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35b0274 err -22/0
   -71> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x20001dc7a7e err -22/-22
   -70> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x200012be364 err -22/-22
   -69> 2024-05-06T16:07:15.518+ 7f1921e91700  0 mds.0.cache  
failed to open ino 0x35b2e32 err -22/-22
   -68> 2024-05-06T16:07:15.518+ 7f1921e91

[ceph-users] Re: MDS crashes shortly after starting

2024-05-06 Thread Xiubo Li
The same issue with https://tracker.ceph.com/issues/60986 and as Robert 
Sander reported.


On 5/6/24 05:11, E Taka wrote:

Hi all,

we have a serious problem with CephFS. A few days ago, the CephFS file
systems became inaccessible, with the message MDS_DAMAGE: 1 mds daemon
damaged

The cephfs-journal-tool tells us: "Overall journal integrity: OK"

The usual attempts with redeploy were unfortunately not successful.

After many attempts to achieve something with the orchestrator, we set the
MDS to “failed” and provoked the creation of new MDS with “ceph fs reset”.

But this MDS crashes:
ceph-17.2.7/src/mds/MDCache.cc: In function 'void
MDCache::rejoin_send_rejoins()'
ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

(The full trace is attached).

What can we do now? We are grateful for any help!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS 17.2.7 crashes at rejoin

2024-05-06 Thread Robert Sander
Hi,

would an update to 18.2 help?

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS 17.2.7 crashes at rejoin

2024-05-06 Thread Xiubo Li

Possibly, because we have seen this only in ceph 17.

And if you could reproduce it then please provide the mds debug logs, 
after this we can quickly find the root cause of it.


Thanks

- Xiubo


On 5/7/24 12:19, Robert Sander wrote:

Hi,

would an update to 18.2 help?

Regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mysterious Space-Eating Monster

2024-05-06 Thread duluxoz

Thanks Sake,

That recovered just under 4 Gig of space for us

Sorry about the delay getting back to you (been *really* busy) :-)

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io