[ceph-users] Re: What to expect on rejoining a host to cluster?

2022-11-28 Thread Eneko Lacunza

Hi Matt,

Also, make sure that when rejoining host has correct time. I have seen 
clusters going down when rejoining hosts that were down for maintenance 
for various weeks and came in with datetime deltas of some months (no 
idea why that happened, I arrived with the firefighter team ;-) )


Cheers

El 27/11/22 a las 13:27, Frank Schilder escribió:

Hi Matt,

if you didn't touch the OSDs on that host, they will join and only objects that 
have been modified will actually be updated. Ceph keeps some basic history 
information and can detect changes. 2 weeks is not a very long time. If you 
have a lot of cold data, re-integration will go fast.

Initially, you will see a huge amount of misplaced objects. However, this count 
will go down much faster than objects/s recovery.

Before you rejoin the host, I would fix its issues though. Now that you have it 
out of the cluster, do the maintenance first. There is no rush. In fact, you 
can buy a new host, install the OSDs in the new one and join that to the 
cluster with the host-name of the old host.

If you consider replacing the host and all disks, the get a new host first and 
give it the host name in the crush map. Just before you deploy the new host, 
simply purge all down OSDs in its bucket (set norebalance) and deploy. Then, 
the data movement is restricted to re-balancing to the new host.

If you just want to throw out the old host, destroy the OSDs but keep the IDs 
intact (ceph osd destroy). Then, no further re-balancing will happen and you 
can re-use the OSD ids later when adding a new host. That's a stable situation 
from an operations point of view.

Hope that helps.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Matt Larson
Sent: 26 November 2022 21:07:41
To: ceph-users
Subject: [ceph-users] What to expect on rejoining a host to cluster?

Hi all,

  I have had a host with 16 OSDs, each 14TB in capacity that started having
hardware issues causing it to crash.  I took this host down 2 weeks ago,
and the data rebalanced to the remaining 11 server hosts in the Ceph
cluster over this time period.

  My initial goal was to then remove the host completely from the cluster
with `ceph osd rm XX` and `ceph osd purge XX` (Adding/Removing OSDs — Ceph
Documentation
).
However, I found that after the large amount of data migration from the
recovery, that the purge and removal from the crush map for an OSDs still
required another large data move.  It appears that it would have been a
better strategy to assign a 0 weight to an OSD to have only a single larger
data move instead of twice.

  I'd like to join the downed server back into the Ceph cluster.  It still
has 14 OSDs that are listed as out/down that would be brought back online.
My question is what can I expect if I bring this host online?  Will the
OSDs of a host that has been offline for an extended period of time and out
of the cluster have PGs that are now quite different or inconsistent?  Will
this be problematic?

  Thanks for any advice,
Matt

--
Matt Larson, PhD
Madison, WI  53705 U.S.A.
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

2022-11-28 Thread Eugen Block

Hi,

seems like this tracker issue [1] already covers your question. I'll  
update the issue and add a link to our thread.


[1] https://tracker.ceph.com/issues/57767


Zitat von Frank Schilder :


Hi Eugen,

can you confirm that the silent corruption happens also on a  
collocated OSDc (everything on the same device) on pacific? The zap  
command should simply exit with "osd not down+out" or at least not  
do anything.


If this accidentally destructive behaviour is still present, I think  
it is worth a ticket. Since I can't test on versions higher than  
octopus yet, could you then open the ticket?


Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 23 November 2022 09:27:22
To: ceph-users@ceph.io
Subject: [ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

Hi,

I can confirm the behavior for Pacific version 16.2.7. I checked with
a Nautilus test cluster and there it seems to work as expected. I
tried to zap a db device and then restarted one of the OSDs,
successfully. So there seems to be a regression somewhere. I didn't
search for tracker issues yet, but this seems to be worth one, right?

Zitat von Frank Schilder :


Hi all,

on our octopus-latest cluster I accidentally destroyed an up+in OSD
with the command line

  ceph-volume lvm zap /dev/DEV

It executed the dd command and then failed at the lvm commands with
"device busy". Problem number one is, that the OSD continued working
fine. Hence, there is no indication of a corruption, its a silent
corruption. Problem number two - the real one - is, why is
ceph-colume not checking if the OSD that device belongs to is still
up+in? "ceph osd destroy" does that, for example. I believe to
remember that "ceph-volume lvm zap --osd-id" also checks, but I'm
not sure.

Has this been changed in versions later than octopus?

I think it is extremely dangerous to provide a tool that allows the
silent corruption of an entire ceph cluster. The corruption is only
discovered on restart and then it would be too late (unless there is
an in-official recovery procedure somewhere).

I would prefer that ceph-volume lvm zap employs the same strict
sanity checks as other ceph-commands to avoid accidents. In my case
it was a typo, one wrong letter.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

2022-11-28 Thread Frank Schilder
Thanks, also for finding the related tracker issue! It looks like a fix has 
already been approved. Hope it shows up in the next release.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 28 November 2022 10:58:31
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD

Hi,

seems like this tracker issue [1] already covers your question. I'll
update the issue and add a link to our thread.

[1] https://tracker.ceph.com/issues/57767


Zitat von Frank Schilder :

> Hi Eugen,
>
> can you confirm that the silent corruption happens also on a
> collocated OSDc (everything on the same device) on pacific? The zap
> command should simply exit with "osd not down+out" or at least not
> do anything.
>
> If this accidentally destructive behaviour is still present, I think
> it is worth a ticket. Since I can't test on versions higher than
> octopus yet, could you then open the ticket?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Eugen Block 
> Sent: 23 November 2022 09:27:22
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: ceph-volume lvm zap destroyes up+in OSD
>
> Hi,
>
> I can confirm the behavior for Pacific version 16.2.7. I checked with
> a Nautilus test cluster and there it seems to work as expected. I
> tried to zap a db device and then restarted one of the OSDs,
> successfully. So there seems to be a regression somewhere. I didn't
> search for tracker issues yet, but this seems to be worth one, right?
>
> Zitat von Frank Schilder :
>
>> Hi all,
>>
>> on our octopus-latest cluster I accidentally destroyed an up+in OSD
>> with the command line
>>
>>   ceph-volume lvm zap /dev/DEV
>>
>> It executed the dd command and then failed at the lvm commands with
>> "device busy". Problem number one is, that the OSD continued working
>> fine. Hence, there is no indication of a corruption, its a silent
>> corruption. Problem number two - the real one - is, why is
>> ceph-colume not checking if the OSD that device belongs to is still
>> up+in? "ceph osd destroy" does that, for example. I believe to
>> remember that "ceph-volume lvm zap --osd-id" also checks, but I'm
>> not sure.
>>
>> Has this been changed in versions later than octopus?
>>
>> I think it is extremely dangerous to provide a tool that allows the
>> silent corruption of an entire ceph cluster. The corruption is only
>> discovered on restart and then it would be too late (unless there is
>> an in-official recovery procedure somewhere).
>>
>> I would prefer that ceph-volume lvm zap employs the same strict
>> sanity checks as other ceph-commands to avoid accidents. In my case
>> it was a typo, one wrong letter.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph networking

2022-11-28 Thread Jan Marek
Hello,

I have a CEPH cluster with 3 MONs and 6 OSD nodes with 72 OSDs.

I would like to have multiple client and backed networks. I have
now 2x 10Gbps and 2x25Gbps NIC in the nodes and my idea is to
have:

- 2 client network, for example 192.168.1.0/24 on 10Gbps NICs and
192.168.2.0/24 on 25Gbps NICs. One for my clients, one for asynchronous
syncing to another cluster

- 2 backend networks, say 10.0.1.0/24 on 10Gbps NICs and
10.0.2.0/24 on 25Gbps NICs to have multiple backend paths and/or
more throughput.

Is this scenario real? If my clients will be on 192.168.1.0/24
network, will mon give them a addresses of OSD nodes from
192.168.1.0/24 network, or it will give them addresses randomly?

Please, have someone advice, how to set this networking
optimally?

Thanks a lot.

Sincerely
Jan Marek
-- 
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS Snapshot Mirroring slow due to repeating attribute sync

2022-11-28 Thread Venky Shankar
On Tue, Aug 23, 2022 at 10:01 PM Kuhring, Mathias
 wrote:
>
> Dear Ceph developers and users,
>
> We are using ceph version 17.2.1
> (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
> We are using cephadm since version 15 octopus.
>
> We mirror several CephFS directories from our main cluster our to a
> second mirror cluster.
> In particular with bigger directories (over 900 TB and 186 M of files),
> we noticed that mirroring is very slow.
> On the mirror, most of the time we only observe a write speed of 0 to 10
> MB/s in the client IO.
> The target peer directory often doesn't show increase in size during
> syncronization
> (when we check with: getfattr -n ceph.dir.rbytes).
>
> The status of the syncs is always fine, i.e. syncing and not failing:
>
> 0|0[root@osd-1 /var/run/ceph/55633ec3-6c0c-4a02-990c-0f87e0f7a01f]# ceph
> --admin-daemon
> ceph-client.cephfs-mirror.osd-1.ydsqsw.7.94552861013544.asok fs mirror
> peer status cephfs@1 c66afb80-593f-4c42-a120-dd3b6fca26bc
> {
>  "/irods/sodar": {
>  "state": "syncing",
>  "current_sycning_snap": {
>  "id": 7552,
>  "name": "scheduled-2022-08-22-13_00_00"
>  },
>  "last_synced_snap": {
>  "id": 7548,
>  "name": "scheduled-2022-08-22-12_00_00",
>  "sync_duration": 37828.164744490001,
>  "sync_time_stamp": "13240678.542916s"
>  },
>  "snaps_synced": 1,
>  "snaps_deleted": 11,
>  "snaps_renamed": 0
>  }
> }
>
> The cluster nodes (6 per cluster) are connected with Dual 40G NICs to
> the switches.
> Connection between switches are 2x 100G.
> Simple write operations from other clients to the mirror cephfs result
> in writes of e.g. 300 to 400 MB/s.
> So network doesn't seem to be the issue, here.
>
> We started to dig into debug logs of the cephfs-mirror daemon / docker
> container.
> We set the debug level to 20. Otherwise there are no messages at all (so
> no errors).
>
> We observed a lot of messages with "need_data_sync=0, need_attr_sync=1".
> Leading us to the assumption, that instead of actual data a lot of
> attributes are synced.
>
> We started looking at specific examples in the logsband tried to make
> sence from the source code which steps are happening.
> Most of the messages are coming from cephfs::mirror::PeerReplayer
> https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc
>
> We figured, the do_synchronize function checks if data (need_data_sync)
> or attributes (need_attr_sync) should be synchronized using
> should_sync_entry.
> And if necessary performs the sync using remote_file_op.
>
> should_sync_entry reports different ctimes for our examples, e.g.:
> local cur statx: mode=33152, uid=996, gid=993, size=154701172,
> ctime=2022-01-28T12:54:21.176004+, ...
> local prev statx: mode=33152, uid=996, gid=993, size=154701172,
> ctime=2022-08-22T11:03:18.578380+, ...
>
> Based on these different ctimes, should_sync_entry decides then that
> attributes need to be synced:
> *need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)
> https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc#L911
>
> We assume cur statx/cstx refers to the file in the snapshot currently
> mirrored.
> But what exactly is prev statx/pstx? Is it the peer path or the last
> snapshot on the mirror peer?
>
> We can confirm that ctimes are different on the main cluster and the mirror.
> On the main cluster, the ctimes are consistent in every snapshot (since
> the files didn't change).
> On the the mirror, the ctimes increase with every snapshot towards more
> current dates.
>
> Given that the CephFS Mirror daemon writes the data to the mirror as a
> CephFS client,
> it seems to make sense that data on the mirror has different / more
> recent ctimes (from writing).
> Also, when the mirror daemon is syncing the attributes to the mirror,
> wouldn't this trigger an new/current ctime as well?
> So our assumption is, syncing an old ctime will actually result in a new
> ctime.
> And thus trigger the sync of attributes over and over (at least with
> every snapshot synced).
>
> So is ctime the proper parameter to test if attributes need to be synced?
> Or shouldn't it rather be excluded?
> So is this check the right thing to do: *need_attr_sync =
> (cstx.stx_ctime != pstx.stx_ctime)
>
> Is it reasonable to assume that these attribute syncs are responsible
> for our slow mirroring?
> Or is there anything else we should look out for?
>
> And are there actually commands or logs showing us the speed of the
> mirroring?
> We only now about sync_duration and sync_time_stamp (as in the status
> above).
> But then, how can we actually determine the size of a snapshot or the
> difference between snapshots?
> So one can make speed calculations for the latest sync.
>
> What is your general experience with mirroring performance?
>

[ceph-users] Re: Ceph networking

2022-11-28 Thread Stephen Smith6
The “Network Configuration Reference” is always a good place to start:
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Multiple client networks are possible ( see the “public_network” configuration 
option )

I believe you’d configure 2 “public_network”s:


  1.  For actual clients reading / writing data
  2.  For replication

You might also consider dedicating a network for object replication ( 
“cluster_network” ).

RGW multisite: https://docs.ceph.com/en/quincy/radosgw/multisite/
RBD mirroring: https://docs.ceph.com/en/latest/rbd/rbd-mirroring/

Hope that helps.
Eric

From: Jan Marek 
Date: Monday, November 28, 2022 at 8:36 AM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] Ceph networking
Hello,

I have a CEPH cluster with 3 MONs and 6 OSD nodes with 72 OSDs.

I would like to have multiple client and backed networks. I have
now 2x 10Gbps and 2x25Gbps NIC in the nodes and my idea is to
have:

- 2 client network, for example 192.168.1.0/24 on 10Gbps NICs and
192.168.2.0/24 on 25Gbps NICs. One for my clients, one for asynchronous
syncing to another cluster

- 2 backend networks, say 10.0.1.0/24 on 10Gbps NICs and
10.0.2.0/24 on 25Gbps NICs to have multiple backend paths and/or
more throughput.

Is this scenario real? If my clients will be on 192.168.1.0/24
network, will mon give them a addresses of OSD nodes from
192.168.1.0/24 network, or it will give them addresses randomly?

Please, have someone advice, how to set this networking
optimally?

Thanks a lot.

Sincerely
Jan Marek
--
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS Snapshot Mirroring slow due to repeating attribute sync

2022-11-28 Thread Venky Shankar
Hi Mathias,

(apologies for the super late reply - I was getting back from a long
vacation and missed seeing this).

I updated the tracker ticket. Let's move the discussion there...

On Mon, Nov 28, 2022 at 7:46 PM Venky Shankar  wrote:
>
> On Tue, Aug 23, 2022 at 10:01 PM Kuhring, Mathias
>  wrote:
> >
> > Dear Ceph developers and users,
> >
> > We are using ceph version 17.2.1
> > (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).
> > We are using cephadm since version 15 octopus.
> >
> > We mirror several CephFS directories from our main cluster our to a
> > second mirror cluster.
> > In particular with bigger directories (over 900 TB and 186 M of files),
> > we noticed that mirroring is very slow.
> > On the mirror, most of the time we only observe a write speed of 0 to 10
> > MB/s in the client IO.
> > The target peer directory often doesn't show increase in size during
> > syncronization
> > (when we check with: getfattr -n ceph.dir.rbytes).
> >
> > The status of the syncs is always fine, i.e. syncing and not failing:
> >
> > 0|0[root@osd-1 /var/run/ceph/55633ec3-6c0c-4a02-990c-0f87e0f7a01f]# ceph
> > --admin-daemon
> > ceph-client.cephfs-mirror.osd-1.ydsqsw.7.94552861013544.asok fs mirror
> > peer status cephfs@1 c66afb80-593f-4c42-a120-dd3b6fca26bc
> > {
> >  "/irods/sodar": {
> >  "state": "syncing",
> >  "current_sycning_snap": {
> >  "id": 7552,
> >  "name": "scheduled-2022-08-22-13_00_00"
> >  },
> >  "last_synced_snap": {
> >  "id": 7548,
> >  "name": "scheduled-2022-08-22-12_00_00",
> >  "sync_duration": 37828.164744490001,
> >  "sync_time_stamp": "13240678.542916s"
> >  },
> >  "snaps_synced": 1,
> >  "snaps_deleted": 11,
> >  "snaps_renamed": 0
> >  }
> > }
> >
> > The cluster nodes (6 per cluster) are connected with Dual 40G NICs to
> > the switches.
> > Connection between switches are 2x 100G.
> > Simple write operations from other clients to the mirror cephfs result
> > in writes of e.g. 300 to 400 MB/s.
> > So network doesn't seem to be the issue, here.
> >
> > We started to dig into debug logs of the cephfs-mirror daemon / docker
> > container.
> > We set the debug level to 20. Otherwise there are no messages at all (so
> > no errors).
> >
> > We observed a lot of messages with "need_data_sync=0, need_attr_sync=1".
> > Leading us to the assumption, that instead of actual data a lot of
> > attributes are synced.
> >
> > We started looking at specific examples in the logsband tried to make
> > sence from the source code which steps are happening.
> > Most of the messages are coming from cephfs::mirror::PeerReplayer
> > https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc
> >
> > We figured, the do_synchronize function checks if data (need_data_sync)
> > or attributes (need_attr_sync) should be synchronized using
> > should_sync_entry.
> > And if necessary performs the sync using remote_file_op.
> >
> > should_sync_entry reports different ctimes for our examples, e.g.:
> > local cur statx: mode=33152, uid=996, gid=993, size=154701172,
> > ctime=2022-01-28T12:54:21.176004+, ...
> > local prev statx: mode=33152, uid=996, gid=993, size=154701172,
> > ctime=2022-08-22T11:03:18.578380+, ...
> >
> > Based on these different ctimes, should_sync_entry decides then that
> > attributes need to be synced:
> > *need_attr_sync = (cstx.stx_ctime != pstx.stx_ctime)
> > https://github.com/ceph/ceph/blob/6fee777d603aebce492c57b41f3b5760d50ddb07/src/tools/cephfs_mirror/PeerReplayer.cc#L911
> >
> > We assume cur statx/cstx refers to the file in the snapshot currently
> > mirrored.
> > But what exactly is prev statx/pstx? Is it the peer path or the last
> > snapshot on the mirror peer?
> >
> > We can confirm that ctimes are different on the main cluster and the mirror.
> > On the main cluster, the ctimes are consistent in every snapshot (since
> > the files didn't change).
> > On the the mirror, the ctimes increase with every snapshot towards more
> > current dates.
> >
> > Given that the CephFS Mirror daemon writes the data to the mirror as a
> > CephFS client,
> > it seems to make sense that data on the mirror has different / more
> > recent ctimes (from writing).
> > Also, when the mirror daemon is syncing the attributes to the mirror,
> > wouldn't this trigger an new/current ctime as well?
> > So our assumption is, syncing an old ctime will actually result in a new
> > ctime.
> > And thus trigger the sync of attributes over and over (at least with
> > every snapshot synced).
> >
> > So is ctime the proper parameter to test if attributes need to be synced?
> > Or shouldn't it rather be excluded?
> > So is this check the right thing to do: *need_attr_sync =
> > (cstx.stx_ctime != pstx.stx_ctime)
> >
> > Is it reasonable to assume that these attribute syncs are responsible
> > for our slo

[ceph-users] Re: Ceph networking

2022-11-28 Thread Anthony D'Atri

I’ve never done it myself, but the network config options for public/private 
should take a comma-separated list of CIDR blocks.

The client/public should be fine.

For the backend/private/replication network, that is likely overkill.  Are your 
OSDs SSDs or HDDs?  If you do go this route, be sure that all OSD nodes can 
reach each other across both networks.

But really what I suggest is bonding:

* Bond the two 10GE on each node for the public network
* Bond the two 25GE on each node for the private network

Or even the other way around. 

Trying to get traffic balanced with discrete 10 and 25 GE networks could be 
challenging.

> 
> Hello,
> 
> I have a CEPH cluster with 3 MONs and 6 OSD nodes with 72 OSDs.
> 
> I would like to have multiple client and backed networks. I have
> now 2x 10Gbps and 2x25Gbps NIC in the nodes and my idea is to
> have:
> 
> - 2 client network, for example 192.168.1.0/24 on 10Gbps NICs and
> 192.168.2.0/24 on 25Gbps NICs. One for my clients, one for asynchronous
> syncing to another cluster
> 
> - 2 backend networks, say 10.0.1.0/24 on 10Gbps NICs and
> 10.0.2.0/24 on 25Gbps NICs to have multiple backend paths and/or
> more throughput.
> 
> Is this scenario real? If my clients will be on 192.168.1.0/24
> network, will mon give them a addresses of OSD nodes from
> 192.168.1.0/24 network, or it will give them addresses randomly?
> 
> Please, have someone advice, how to set this networking
> optimally?
> 
> Thanks a lot.
> 
> Sincerely
> Jan Marek
> -- 
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-28 Thread Adrien Georget

Hi Xiubo,

I did a journal reset today followed by session reset and then the MDS 
was able to start without switching to readonly mode.

A MDS scrub was also usefull to repair some bad inode backtrace.

Thanks again for your help with this issue!

Cheers,
Adrien

Le 26/11/2022 à 05:08, Xiubo Li a écrit :


On 25/11/2022 16:25, Adrien Georget wrote:

Hi Xiubo,

Thanks for your analysis.
Is there anything I can do to put CephFS back in healthy state? Or 
should I wait for to patch to fix that bug?


Please try to trim the journals and umount all the clients first, and 
then to see could you pull up the MDSs.


- Xiubo


Cheers,
Adrien

Le 25/11/2022 à 06:13, Xiubo Li a écrit :

Hi Adren,

Thank you for your logs.

From your logs I found one bug and I have raised one new tracker [1] 
to follow it, and raised a ceph PR [2] to fix this.


More detail please my analysis in the tracker [2].

[1] https://tracker.ceph.com/issues/58082
[2] https://github.com/ceph/ceph/pull/49048

Thanks

- Xiubo


On 24/11/2022 16:33, Adrien Georget wrote:

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few 
kubernetes pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd 

(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific 
should I looking for?


Best,
Adrien 








___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS stuck ops

2022-11-28 Thread Reed Dier
Hopefully someone will be able to point me in the right direction here:

Cluster is Octopus/15.2.17 on Ubuntu 20.04.
All are kernel cephfs clients, either 5.4.0-131-generic or 5.15.0-52-generic.
Cluster is nearful, and more storage is coming, but still 2-4 weeks out from 
delivery.

> HEALTH_WARN 1 clients failing to respond to capability release; 1 clients 
> failing to advance oldest client/flush tid; 1 MDSs report slow requests; 2 
> MDSs behind on trimming; 28 nearfull osd(s); 8 pool(s) nearfull; (muted: 
> MDS_CLIENT_RECALL POOL_TOO_FEW_PGS POOL_TOO_MANY_PGS)
> [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
> release
> mds.mds1(mds.0): Client $client1 failing to respond to capability release 
> client_id: 2825526519
> [WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush 
> tid
> mds.mds1(mds.0): Client $client2 failing to advance its oldest 
> client/flush tid.  client_id: 2825533964
> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> mds.mds1(mds.0): 4 slow requests are blocked > 30 secs
> [WRN] MDS_TRIM: 2 MDSs behind on trimming
> mds.mds1(mds.0): Behind on trimming (13258/128) max_segments: 128, 
> num_segments: 13258
> mds.mds2(mds.0): Behind on trimming (13260/128) max_segments: 128, 
> num_segments: 13260
> [WRN] OSD_NEARFULL: 28 nearfull osd(s)

> cephfs - 121 clients
> ==
> RANK  STATE   MDS  ACTIVITY DNSINOS
>  0active   mds1   Reqs: 4303 /s  5905k  5880k
> 0-s   standby-replay   mds2   Evts:  244 /s  1483k   586k
> POOL   TYPE USED  AVAIL
> fs-metadata  metadata   243G  11.0T
>fs-hd3  data3191G  12.0T
>fs-ec73 data 169T  25.3T
>fs-ec82 data 211T  28.9T
> MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
> octopus (stable)

Pastebin of mds ops-in-flight: https://pastebin.com/5DqBDynj 


I seem to have about 43 mds ops that are just stuck and not progressing, and 
I’m unsure how to unstick the ops and get everything back to a healthy state.
Comparing the client ID’s for the stuck ops against ceph tell mds.$mds client 
ls, I don’t see any patterns for a specific problematic client(s) or kernel 
version(s).
The fs-metadata pool is on SSDs, while the data pools are on HDD’s in various 
replication/EC configs.

I decreased the mds_cache_trim_decay_rate down to 0.9, but the num_segments 
just continues to climb.
I suspect that trimming may be queued behind some operation that is stuck.

I’ve considered bumping up the nearful ratio up to try and see if getting out 
of synchronous writes penalty makes any difference, but I assume something may 
be more deeply unhappy than just that.

Appreciate any pointers anyone can give.

Thanks,
Reed
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Venky Shankar
On Mon, Nov 28, 2022 at 10:19 PM Reed Dier  wrote:
>
> Hopefully someone will be able to point me in the right direction here:
>
> Cluster is Octopus/15.2.17 on Ubuntu 20.04.
> All are kernel cephfs clients, either 5.4.0-131-generic or 5.15.0-52-generic.
> Cluster is nearful, and more storage is coming, but still 2-4 weeks out from 
> delivery.
>
> > HEALTH_WARN 1 clients failing to respond to capability release; 1 clients 
> > failing to advance oldest client/flush tid; 1 MDSs report slow requests; 2 
> > MDSs behind on trimming; 28 nearfull osd(s); 8 pool(s) nearfull; (muted: 
> > MDS_CLIENT_RECALL POOL_TOO_FEW_PGS POOL_TOO_MANY_PGS)
> > [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
> > release
> > mds.mds1(mds.0): Client $client1 failing to respond to capability 
> > release client_id: 2825526519
> > [WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest 
> > client/flush tid
> > mds.mds1(mds.0): Client $client2 failing to advance its oldest 
> > client/flush tid.  client_id: 2825533964
> > [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> > mds.mds1(mds.0): 4 slow requests are blocked > 30 secs
> > [WRN] MDS_TRIM: 2 MDSs behind on trimming
> > mds.mds1(mds.0): Behind on trimming (13258/128) max_segments: 128, 
> > num_segments: 13258
> > mds.mds2(mds.0): Behind on trimming (13260/128) max_segments: 128, 
> > num_segments: 13260
> > [WRN] OSD_NEARFULL: 28 nearfull osd(s)
>
> > cephfs - 121 clients
> > ==
> > RANK  STATE   MDS  ACTIVITY DNSINOS
> >  0active   mds1   Reqs: 4303 /s  5905k  5880k
> > 0-s   standby-replay   mds2   Evts:  244 /s  1483k   586k
> > POOL   TYPE USED  AVAIL
> > fs-metadata  metadata   243G  11.0T
> >fs-hd3  data3191G  12.0T
> >fs-ec73 data 169T  25.3T
> >fs-ec82 data 211T  28.9T
> > MDS version: ceph version 15.2.17 
> > (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>
> Pastebin of mds ops-in-flight: https://pastebin.com/5DqBDynj 
> 

A good chunk of those are waiting for the directory to finish
fragmentation (split). I think those ops are not progressing since
fragmentation involves creating more objects in the metadata pool.

>
> I seem to have about 43 mds ops that are just stuck and not progressing, and 
> I’m unsure how to unstick the ops and get everything back to a healthy state.
> Comparing the client ID’s for the stuck ops against ceph tell mds.$mds client 
> ls, I don’t see any patterns for a specific problematic client(s) or kernel 
> version(s).
> The fs-metadata pool is on SSDs, while the data pools are on HDD’s in various 
> replication/EC configs.
>
> I decreased the mds_cache_trim_decay_rate down to 0.9, but the num_segments 
> just continues to climb.
> I suspect that trimming may be queued behind some operation that is stuck.

Update ops will involve appending to the mds journal consuming disk
space which you are already running out of.

>
> I’ve considered bumping up the nearful ratio up to try and see if getting out 
> of synchronous writes penalty makes any difference, but I assume something 
> may be more deeply unhappy than just that.
>
> Appreciate any pointers anyone can give.

If you have snapshots that are no longer required, maybe consider
deleting those?

>
> Thanks,
> Reed
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Reed Dier
Hi Venky,

Thanks for responding.

> A good chunk of those are waiting for the directory to finish
> fragmentation (split). I think those ops are not progressing since
> fragmentation involves creating more objects in the metadata pool.

> Update ops will involve appending to the mds journal consuming disk
> space which you are already running out of.

So the metadata pool is on SSD’s, which are not nearful.
So I don’t believe that space should be an issue.

> POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> fs-metadata1632   84 GiB   10.63M  251 GiB   0.74 11 TiB


But in the past I feel like all OSDs got implicated in the nearful penalty.
Assuming that to be true, could the dirfrag split be slowed by the nearful sync 
writes?
If so, maybe moving the nearful needle temporarily could get the dirfrag split 
across the finish line, and then I can retreat to nearful safety?
Is there a way to monitor dirfrag progress?

> If you have snapshots that are no longer required, maybe consider
> deleting those?

There are actually no snapshots on cephfs, so that shouldn’t be an issue either.

> # ceph fs get cephfs
> Filesystem 'cephfs' (1)
> fs_name cephfs
> epoch   1081642
> flags   30
> created 2016-12-01T12:02:37.528559-0500
> modified2022-11-28T13:03:52.630590-0500
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> min_compat_client   0 (unknown)
> last_failure0
> last_failure_osd_epoch  0
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
> layout v2,10=snaprealm v2}
> max_mds 1
> in  0
> up  {0=2824746206}
> failed
> damaged
> stopped
> data_pools  [17,37,40]
> metadata_pool   16
> inline_data disabled
> balancer
> standby_count_wanted1

Including the fs info in case there is a compat issue that stands out?
Only a single rank, with active/standby-replay MDS.

I also don’t have any MDS specific configs set, outside of 
mds_cache_memory_limit and mds_standby_replay,
So all of the mds_bal_* values should be defaults.

Again, appreciate the pointers.

Thanks,
Reed


> On Nov 28, 2022, at 11:41 AM, Venky Shankar  wrote:
> 
> On Mon, Nov 28, 2022 at 10:19 PM Reed Dier  > wrote:
>> 
>> Hopefully someone will be able to point me in the right direction here:
>> 
>> Cluster is Octopus/15.2.17 on Ubuntu 20.04.
>> All are kernel cephfs clients, either 5.4.0-131-generic or 5.15.0-52-generic.
>> Cluster is nearful, and more storage is coming, but still 2-4 weeks out from 
>> delivery.
>> 
>>> HEALTH_WARN 1 clients failing to respond to capability release; 1 clients 
>>> failing to advance oldest client/flush tid; 1 MDSs report slow requests; 2 
>>> MDSs behind on trimming; 28 nearfull osd(s); 8 pool(s) nearfull; (muted: 
>>> MDS_CLIENT_RECALL POOL_TOO_FEW_PGS POOL_TOO_MANY_PGS)
>>> [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
>>> release
>>>mds.mds1(mds.0): Client $client1 failing to respond to capability 
>>> release client_id: 2825526519
>>> [WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest 
>>> client/flush tid
>>>mds.mds1(mds.0): Client $client2 failing to advance its oldest 
>>> client/flush tid.  client_id: 2825533964
>>> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
>>>mds.mds1(mds.0): 4 slow requests are blocked > 30 secs
>>> [WRN] MDS_TRIM: 2 MDSs behind on trimming
>>>mds.mds1(mds.0): Behind on trimming (13258/128) max_segments: 128, 
>>> num_segments: 13258
>>>mds.mds2(mds.0): Behind on trimming (13260/128) max_segments: 128, 
>>> num_segments: 13260
>>> [WRN] OSD_NEARFULL: 28 nearfull osd(s)
>> 
>>> cephfs - 121 clients
>>> ==
>>> RANK  STATE   MDS  ACTIVITY DNSINOS
>>> 0active   mds1   Reqs: 4303 /s  5905k  5880k
>>> 0-s   standby-replay   mds2   Evts:  244 /s  1483k   586k
>>>POOL   TYPE USED  AVAIL
>>> fs-metadata  metadata   243G  11.0T
>>>   fs-hd3  data3191G  12.0T
>>>   fs-ec73 data 169T  25.3T
>>>   fs-ec82 data 211T  28.9T
>>> MDS version: ceph version 15.2.17 
>>> (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>> 
>> Pastebin of mds ops-in-flight: https://pastebin.com/5DqBDynj 
>>  > >
> 
> A good chunk of those are waiting for the directory to finish
> fragmentation (split). I think those ops are not progressing since
> fragmentation involves creating more objects in the metadata pool.
> 
>> 
>> I seem to have about 43 mds ops that are just stuck and not progressing, and 
>> I’m unsure how to unstick the ops and get everything back to a healthy state.
>> Comparing the client ID’s for the stuck ops a

[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Frank Schilder
Hi Reed,

I sometimes stuck had MDS ops as well, making the journal trim stop and the 
meta data pool running full slowly. Its usually a race condition in the MDS ops 
queue and re-scheduling the OPS in the MDS queue resolves it. To achieve that, 
I usually try in escalating order:

- Find the client causing the oldest stuck OP. Try a dropcaches and/or mount -o 
remount. This is the least disruptive but does not work often. If the blocked 
ops count goes down, proceed with the next client, if necessary.

- Kill the process that submitted the stuck OP (a process in D-state on the 
client, can be difficult to get it to die, I usually succeed by killing its 
parent). If this works, it usually helps, but does terminate a user process.

- Try to evict the client on the MDS side, but allow it to rejoin. This may 
require to clear the OSD blacklist fast after eviction. This tends to help but 
might lead to the client not being able to join, which in turn means a reboot.

- Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it 
for me in 100% of cases, but causes a short period of unavailable FS. The newly 
started MDS will have to replay the entire MDS journal, which in your case is a 
lot. I also have the meta data pool on SSD, but I had the pool full once and it 
took like 20 minutes to replay the journal (was way over 1 or 2TB by that 
time). In my case it didn't matter any more as the FS was unavailable any ways.

I used to have a lot of problems with dirfrags all the time as well. They seem 
to cause race conditions. I got out of this by pinning directories to MDS 
ranks. You find my experience in the recent thread "MDS internal op exportdir 
despite ephemeral pinning". Since I pinned everything all problems are gone and 
performance is boosted. We are also on octopus.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Reed Dier 
Sent: 28 November 2022 19:14:55
To: Venky Shankar
Cc: ceph-users
Subject: [ceph-users] Re: MDS stuck ops

Hi Venky,

Thanks for responding.

> A good chunk of those are waiting for the directory to finish
> fragmentation (split). I think those ops are not progressing since
> fragmentation involves creating more objects in the metadata pool.

> Update ops will involve appending to the mds journal consuming disk
> space which you are already running out of.

So the metadata pool is on SSD’s, which are not nearful.
So I don’t believe that space should be an issue.

> POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> fs-metadata1632   84 GiB   10.63M  251 GiB   0.74 11 TiB


But in the past I feel like all OSDs got implicated in the nearful penalty.
Assuming that to be true, could the dirfrag split be slowed by the nearful sync 
writes?
If so, maybe moving the nearful needle temporarily could get the dirfrag split 
across the finish line, and then I can retreat to nearful safety?
Is there a way to monitor dirfrag progress?

> If you have snapshots that are no longer required, maybe consider
> deleting those?

There are actually no snapshots on cephfs, so that shouldn’t be an issue either.

> # ceph fs get cephfs
> Filesystem 'cephfs' (1)
> fs_name cephfs
> epoch   1081642
> flags   30
> created 2016-12-01T12:02:37.528559-0500
> modified2022-11-28T13:03:52.630590-0500
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> min_compat_client   0 (unknown)
> last_failure0
> last_failure_osd_epoch  0
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
> layout v2,10=snaprealm v2}
> max_mds 1
> in  0
> up  {0=2824746206}
> failed
> damaged
> stopped
> data_pools  [17,37,40]
> metadata_pool   16
> inline_data disabled
> balancer
> standby_count_wanted1

Including the fs info in case there is a compat issue that stands out?
Only a single rank, with active/standby-replay MDS.

I also don’t have any MDS specific configs set, outside of 
mds_cache_memory_limit and mds_standby_replay,
So all of the mds_bal_* values should be defaults.

Again, appreciate the pointers.

Thanks,
Reed


> On Nov 28, 2022, at 11:41 AM, Venky Shankar  wrote:
>
> On Mon, Nov 28, 2022 at 10:19 PM Reed Dier  > wrote:
>>
>> Hopefully someone will be able to point me in the right direction here:
>>
>> Cluster is Octopus/15.2.17 on Ubuntu 20.04.
>> All are kernel cephfs clients, either 5.4.0-131-generic or 5.15.0-52-generic.
>> Cluster is nearful, and more storage is coming, but still 2-4 weeks out from 
>> delivery.
>>
>>> HEALTH_WARN 1 clients failing to respond to capability release; 1 clients 
>>> failing to advance oldest client/flush tid; 1 MDSs report slow requests; 2

[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Reed Dier
So, ironically, I did try and take some of these approaches here.

I first moved the nearfull goalpost to see if that made a difference, it did 
for client writes, but not for the metadata to unstick.

I did some hunting for some hung/waiting processes on some of the client nodes, 
and was able to whack a few of those.
Then, finding the stuck ops in flight, taking the client ID’s, and looping 
through with client evict, followed by 3 blocklist clears with a 1s sleep 
between each blocklist clear.
It got through about 6 or 7 of the clients, which appeared to handle 
reconnecting with the quick blocklist clear, before the MDS died and failed to 
the standby-replay.
The good and bad part here is that at this point, everything unstuck.
All of the slow/stuck ops in flight disappeared, and a few stuck processes 
appeared to spring back to life now that io was flowing.
Both MDS started trimming, and all was well.
The bad part is that the “solution" was to just bounce the MDS it appears, 
which didn’t instinctively feel like the right hammer to swing, but alas.
And of course revert the nearfull ratio.

That said, I did upload the crash report: 
"2022-11-28T21:02:12.655542Z_c1fcfca7-bd08-4da8-abcd-f350cc59fb80”

Appreciate everyone’s input.

Thanks,
Reed

> On Nov 28, 2022, at 1:02 PM, Frank Schilder  wrote:
> 
> Hi Reed,
> 
> I sometimes stuck had MDS ops as well, making the journal trim stop and the 
> meta data pool running full slowly. Its usually a race condition in the MDS 
> ops queue and re-scheduling the OPS in the MDS queue resolves it. To achieve 
> that, I usually try in escalating order:
> 
> - Find the client causing the oldest stuck OP. Try a dropcaches and/or mount 
> -o remount. This is the least disruptive but does not work often. If the 
> blocked ops count goes down, proceed with the next client, if necessary.
> 
> - Kill the process that submitted the stuck OP (a process in D-state on the 
> client, can be difficult to get it to die, I usually succeed by killing its 
> parent). If this works, it usually helps, but does terminate a user process.
> 
> - Try to evict the client on the MDS side, but allow it to rejoin. This may 
> require to clear the OSD blacklist fast after eviction. This tends to help 
> but might lead to the client not being able to join, which in turn means a 
> reboot.
> 
> - Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it 
> for me in 100% of cases, but causes a short period of unavailable FS. The 
> newly started MDS will have to replay the entire MDS journal, which in your 
> case is a lot. I also have the meta data pool on SSD, but I had the pool full 
> once and it took like 20 minutes to replay the journal (was way over 1 or 2TB 
> by that time). In my case it didn't matter any more as the FS was unavailable 
> any ways.
> 
> I used to have a lot of problems with dirfrags all the time as well. They 
> seem to cause race conditions. I got out of this by pinning directories to 
> MDS ranks. You find my experience in the recent thread "MDS internal op 
> exportdir despite ephemeral pinning". Since I pinned everything all problems 
> are gone and performance is boosted. We are also on octopus.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Reed Dier mailto:reed.d...@focusvq.com>>
> Sent: 28 November 2022 19:14:55
> To: Venky Shankar
> Cc: ceph-users
> Subject: [ceph-users] Re: MDS stuck ops
> 
> Hi Venky,
> 
> Thanks for responding.
> 
>> A good chunk of those are waiting for the directory to finish
>> fragmentation (split). I think those ops are not progressing since
>> fragmentation involves creating more objects in the metadata pool.
> 
>> Update ops will involve appending to the mds journal consuming disk
>> space which you are already running out of.
> 
> So the metadata pool is on SSD’s, which are not nearful.
> So I don’t believe that space should be an issue.
> 
>> POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
>> fs-metadata1632   84 GiB   10.63M  251 GiB   0.74 11 TiB
> 
> 
> But in the past I feel like all OSDs got implicated in the nearful penalty.
> Assuming that to be true, could the dirfrag split be slowed by the nearful 
> sync writes?
> If so, maybe moving the nearful needle temporarily could get the dirfrag 
> split across the finish line, and then I can retreat to nearful safety?
> Is there a way to monitor dirfrag progress?
> 
>> If you have snapshots that are no longer required, maybe consider
>> deleting those?
> 
> There are actually no snapshots on cephfs, so that shouldn’t be an issue 
> either.
> 
>> # ceph fs get cephfs
>> Filesystem 'cephfs' (1)
>> fs_name cephfs
>> epoch   1081642
>> flags   30
>> created 2016-12-01T12:02:37.528559-0500
>> modified2022-11-28T13:03:52.630590-0500
>> tableserver 0
>> root0
>> session_timeout 60
>> session_autoclose  

[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Frank Schilder
Hi Reed,

forget what I wrote about pinning, you use only 1 MDS, so it won't change 
anything. I think the problem you are facing is with the standby-replay daemon 
mode. I used that in the past too, but found out that it actually didn't help 
with fail-over speed to begin with. On top of that, the replay seems not to be 
rock-solid and ops got stuck.

In the end I reverted to the simple active+standby daemons and never had 
problems again. My impression is that fail-over is actually faster to a normal 
standby than to a standby-replay daemon. I'm not sure in which scenario 
standby-replay improves things, I just never saw a benefit on our cluster.

During out-of-office hours I usually go straight for MDS fail in case of 
problems. During work hours I make an attempt to be nice before failing an MDS. 
On our cluster though we have 8 active MDS daemons and everything pinned to 
ranks. If I fail an MDS, its only 1/8th of users noticing (except maybe rank 
0). The fail-over is usually fast enough that I don't get complaints. We have 
ca. 1700 kernel clients, it takes a few minutes for the new MS to become active.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Reed Dier 
Sent: 28 November 2022 22:43:12
To: ceph-users
Cc: Venky Shankar; Frank Schilder
Subject: Re: [ceph-users] MDS stuck ops

So, ironically, I did try and take some of these approaches here.

I first moved the nearfull goalpost to see if that made a difference, it did 
for client writes, but not for the metadata to unstick.

I did some hunting for some hung/waiting processes on some of the client nodes, 
and was able to whack a few of those.
Then, finding the stuck ops in flight, taking the client ID’s, and looping 
through with client evict, followed by 3 blocklist clears with a 1s sleep 
between each blocklist clear.
It got through about 6 or 7 of the clients, which appeared to handle 
reconnecting with the quick blocklist clear, before the MDS died and failed to 
the standby-replay.
The good and bad part here is that at this point, everything unstuck.
All of the slow/stuck ops in flight disappeared, and a few stuck processes 
appeared to spring back to life now that io was flowing.
Both MDS started trimming, and all was well.
The bad part is that the “solution" was to just bounce the MDS it appears, 
which didn’t instinctively feel like the right hammer to swing, but alas.
And of course revert the nearfull ratio.

That said, I did upload the crash report: 
"2022-11-28T21:02:12.655542Z_c1fcfca7-bd08-4da8-abcd-f350cc59fb80”

Appreciate everyone’s input.

Thanks,
Reed

On Nov 28, 2022, at 1:02 PM, Frank Schilder mailto:fr...@dtu.dk>> 
wrote:

Hi Reed,

I sometimes stuck had MDS ops as well, making the journal trim stop and the 
meta data pool running full slowly. Its usually a race condition in the MDS ops 
queue and re-scheduling the OPS in the MDS queue resolves it. To achieve that, 
I usually try in escalating order:

- Find the client causing the oldest stuck OP. Try a dropcaches and/or mount -o 
remount. This is the least disruptive but does not work often. If the blocked 
ops count goes down, proceed with the next client, if necessary.

- Kill the process that submitted the stuck OP (a process in D-state on the 
client, can be difficult to get it to die, I usually succeed by killing its 
parent). If this works, it usually helps, but does terminate a user process.

- Try to evict the client on the MDS side, but allow it to rejoin. This may 
require to clear the OSD blacklist fast after eviction. This tends to help but 
might lead to the client not being able to join, which in turn means a reboot.

- Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it 
for me in 100% of cases, but causes a short period of unavailable FS. The newly 
started MDS will have to replay the entire MDS journal, which in your case is a 
lot. I also have the meta data pool on SSD, but I had the pool full once and it 
took like 20 minutes to replay the journal (was way over 1 or 2TB by that 
time). In my case it didn't matter any more as the FS was unavailable any ways.

I used to have a lot of problems with dirfrags all the time as well. They seem 
to cause race conditions. I got out of this by pinning directories to MDS 
ranks. You find my experience in the recent thread "MDS internal op exportdir 
despite ephemeral pinning". Since I pinned everything all problems are gone and 
performance is boosted. We are also on octopus.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Reed Dier mailto:reed.d...@focusvq.com>>
Sent: 28 November 2022 19:14:55
To: Venky Shankar
Cc: ceph-users
Subject: [ceph-users] Re: MDS stuck ops

Hi Venky,

Thanks for responding.

A good chunk of those are waiting for the directory to finish
fragmentation (split). I think those ops are not progressing since

[ceph-users] Re: filesystem became read only after Quincy upgrade

2022-11-28 Thread Xiubo Li


On 28/11/2022 23:21, Adrien Georget wrote:

Hi Xiubo,

I did a journal reset today followed by session reset and then the MDS 
was able to start without switching to readonly mode.

A MDS scrub was also usefull to repair some bad inode backtrace.

Thanks again for your help with this issue!


Cool!

- Xiubo



Cheers,
Adrien

Le 26/11/2022 à 05:08, Xiubo Li a écrit :


On 25/11/2022 16:25, Adrien Georget wrote:

Hi Xiubo,

Thanks for your analysis.
Is there anything I can do to put CephFS back in healthy state? Or 
should I wait for to patch to fix that bug?


Please try to trim the journals and umount all the clients first, and 
then to see could you pull up the MDSs.


- Xiubo


Cheers,
Adrien

Le 25/11/2022 à 06:13, Xiubo Li a écrit :

Hi Adren,

Thank you for your logs.

From your logs I found one bug and I have raised one new tracker 
[1] to follow it, and raised a ceph PR [2] to fix this.


More detail please my analysis in the tracker [2].

[1] https://tracker.ceph.com/issues/58082
[2] https://github.com/ceph/ceph/pull/49048

Thanks

- Xiubo


On 24/11/2022 16:33, Adrien Georget wrote:

Hi Xiubo,

We did the upgrade in rolling mode as always, with only few 
kubernetes pods as clients accessing their PVC on CephFS.


I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : 
https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd 

(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 
mds.0.server force_clients_readonly)


I couldn't find any errors in the OSD logs, anything specific 
should I looking for?


Best,
Adrien 










___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Venky Shankar
Hi Reed,

On Tue, Nov 29, 2022 at 3:13 AM Reed Dier  wrote:
>
> So, ironically, I did try and take some of these approaches here.
>
> I first moved the nearfull goalpost to see if that made a difference, it did 
> for client writes, but not for the metadata to unstick.
>
> I did some hunting for some hung/waiting processes on some of the client 
> nodes, and was able to whack a few of those.
> Then, finding the stuck ops in flight, taking the client ID’s, and looping 
> through with client evict, followed by 3 blocklist clears with a 1s sleep 
> between each blocklist clear.
> It got through about 6 or 7 of the clients, which appeared to handle 
> reconnecting with the quick blocklist clear, before the MDS died and failed 
> to the standby-replay.
> The good and bad part here is that at this point, everything unstuck.
> All of the slow/stuck ops in flight disappeared, and a few stuck processes 
> appeared to spring back to life now that io was flowing.

It seems you probably hit a bug in the MDS that didn't allow it to
progress with client I/O and/or trimming (after adjusting nearfull
ratio).

> Both MDS started trimming, and all was well.
> The bad part is that the “solution" was to just bounce the MDS it appears, 
> which didn’t instinctively feel like the right hammer to swing, but alas.
> And of course revert the nearfull ratio.
>
> That said, I did upload the crash report: 
> "2022-11-28T21:02:12.655542Z_c1fcfca7-bd08-4da8-abcd-f350cc59fb80”

This should help - thanks! I'll have a look (when drop.ceph.com is
reachable for me).

>
> Appreciate everyone’s input.
>
> Thanks,
> Reed
>
> On Nov 28, 2022, at 1:02 PM, Frank Schilder  wrote:
>
> Hi Reed,
>
> I sometimes stuck had MDS ops as well, making the journal trim stop and the 
> meta data pool running full slowly. Its usually a race condition in the MDS 
> ops queue and re-scheduling the OPS in the MDS queue resolves it. To achieve 
> that, I usually try in escalating order:
>
> - Find the client causing the oldest stuck OP. Try a dropcaches and/or mount 
> -o remount. This is the least disruptive but does not work often. If the 
> blocked ops count goes down, proceed with the next client, if necessary.
>
> - Kill the process that submitted the stuck OP (a process in D-state on the 
> client, can be difficult to get it to die, I usually succeed by killing its 
> parent). If this works, it usually helps, but does terminate a user process.
>
> - Try to evict the client on the MDS side, but allow it to rejoin. This may 
> require to clear the OSD blacklist fast after eviction. This tends to help 
> but might lead to the client not being able to join, which in turn means a 
> reboot.
>
> - Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it 
> for me in 100% of cases, but causes a short period of unavailable FS. The 
> newly started MDS will have to replay the entire MDS journal, which in your 
> case is a lot. I also have the meta data pool on SSD, but I had the pool full 
> once and it took like 20 minutes to replay the journal (was way over 1 or 2TB 
> by that time). In my case it didn't matter any more as the FS was unavailable 
> any ways.
>
> I used to have a lot of problems with dirfrags all the time as well. They 
> seem to cause race conditions. I got out of this by pinning directories to 
> MDS ranks. You find my experience in the recent thread "MDS internal op 
> exportdir despite ephemeral pinning". Since I pinned everything all problems 
> are gone and performance is boosted. We are also on octopus.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Reed Dier 
> Sent: 28 November 2022 19:14:55
> To: Venky Shankar
> Cc: ceph-users
> Subject: [ceph-users] Re: MDS stuck ops
>
> Hi Venky,
>
> Thanks for responding.
>
> A good chunk of those are waiting for the directory to finish
> fragmentation (split). I think those ops are not progressing since
> fragmentation involves creating more objects in the metadata pool.
>
>
> Update ops will involve appending to the mds journal consuming disk
> space which you are already running out of.
>
>
> So the metadata pool is on SSD’s, which are not nearful.
> So I don’t believe that space should be an issue.
>
> POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> fs-metadata1632   84 GiB   10.63M  251 GiB   0.74 11 TiB
>
>
>
> But in the past I feel like all OSDs got implicated in the nearful penalty.
> Assuming that to be true, could the dirfrag split be slowed by the nearful 
> sync writes?
> If so, maybe moving the nearful needle temporarily could get the dirfrag 
> split across the finish line, and then I can retreat to nearful safety?
> Is there a way to monitor dirfrag progress?
>
> If you have snapshots that are no longer required, maybe consider
> deleting those?
>
>
> There are actually no snapshots on cephfs, so that shouldn’t be an is

[ceph-users] Re: MDS stuck ops

2022-11-28 Thread Venky Shankar
Hi Frank,

On Tue, Nov 29, 2022 at 12:32 AM Frank Schilder  wrote:
>
> Hi Reed,
>
> I sometimes stuck had MDS ops as well, making the journal trim stop and the 
> meta data pool running full slowly. Its usually a race condition in the MDS 
> ops queue and re-scheduling the OPS in the MDS queue resolves it. To achieve 
> that, I usually try in escalating order:
>
> - Find the client causing the oldest stuck OP. Try a dropcaches and/or mount 
> -o remount. This is the least disruptive but does not work often. If the 
> blocked ops count goes down, proceed with the next client, if necessary.
>
> - Kill the process that submitted the stuck OP (a process in D-state on the 
> client, can be difficult to get it to die, I usually succeed by killing its 
> parent). If this works, it usually helps, but does terminate a user process.
>
> - Try to evict the client on the MDS side, but allow it to rejoin. This may 
> require to clear the OSD blacklist fast after eviction. This tends to help 
> but might lead to the client not being able to join, which in turn means a 
> reboot.
>
> - Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it 
> for me in 100% of cases, but causes a short period of unavailable FS. The 
> newly started MDS will have to replay the entire MDS journal, which in your 
> case is a lot. I also have the meta data pool on SSD, but I had the pool full 
> once and it took like 20 minutes to replay the journal (was way over 1 or 2TB 
> by that time). In my case it didn't matter any more as the FS was unavailable 
> any ways.
>
> I used to have a lot of problems with dirfrags all the time as well. They 
> seem to cause race conditions. I got out of this by pinning directories to 
> MDS ranks. You find my experience in the recent thread "MDS internal op 
> exportdir despite ephemeral pinning". Since I pinned everything all problems 
> are gone and performance is boosted. We are also on octopus.

You most likely ran into performance issues with distributed ephemeral
pins with octopus. It'd be nice to try out one of the latest releases
for this.

>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Reed Dier 
> Sent: 28 November 2022 19:14:55
> To: Venky Shankar
> Cc: ceph-users
> Subject: [ceph-users] Re: MDS stuck ops
>
> Hi Venky,
>
> Thanks for responding.
>
> > A good chunk of those are waiting for the directory to finish
> > fragmentation (split). I think those ops are not progressing since
> > fragmentation involves creating more objects in the metadata pool.
>
> > Update ops will involve appending to the mds journal consuming disk
> > space which you are already running out of.
>
> So the metadata pool is on SSD’s, which are not nearful.
> So I don’t believe that space should be an issue.
>
> > POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
> > fs-metadata1632   84 GiB   10.63M  251 GiB   0.74 11 TiB
>
>
> But in the past I feel like all OSDs got implicated in the nearful penalty.
> Assuming that to be true, could the dirfrag split be slowed by the nearful 
> sync writes?
> If so, maybe moving the nearful needle temporarily could get the dirfrag 
> split across the finish line, and then I can retreat to nearful safety?
> Is there a way to monitor dirfrag progress?
>
> > If you have snapshots that are no longer required, maybe consider
> > deleting those?
>
> There are actually no snapshots on cephfs, so that shouldn’t be an issue 
> either.
>
> > # ceph fs get cephfs
> > Filesystem 'cephfs' (1)
> > fs_name cephfs
> > epoch   1081642
> > flags   30
> > created 2016-12-01T12:02:37.528559-0500
> > modified2022-11-28T13:03:52.630590-0500
> > tableserver 0
> > root0
> > session_timeout 60
> > session_autoclose   300
> > max_file_size   1099511627776
> > min_compat_client   0 (unknown)
> > last_failure0
> > last_failure_osd_epoch  0
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> > uses versioned encoding,6=dirfrag is stored in omap,8=no anchor 
> > table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in  0
> > up  {0=2824746206}
> > failed
> > damaged
> > stopped
> > data_pools  [17,37,40]
> > metadata_pool   16
> > inline_data disabled
> > balancer
> > standby_count_wanted1
>
> Including the fs info in case there is a compat issue that stands out?
> Only a single rank, with active/standby-replay MDS.
>
> I also don’t have any MDS specific configs set, outside of 
> mds_cache_memory_limit and mds_standby_replay,
> So all of the mds_bal_* values should be defaults.
>
> Again, appreciate the pointers.
>
> Thanks,
> Reed
>
>
> > On Nov 28, 2022, at 11:41 AM, Venky Shankar  wrote:
> >
> > On Mon, Nov 28, 2022 at 10:19 PM Reed Dier  > > wrote:
> >>
> >> H

[ceph-users] PGs stuck down

2022-11-28 Thread Wolfpaw - Dale Corse
Hi All,

 

We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.

 

This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.

 

When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.

 

I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(

 

Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)

 

Thanks!

Dale

 

# begin crush map

tunable choose_local_tries 0

tunable choose_local_fallback_tries 0

tunable choose_total_tries 50

tunable chooseleaf_descend_once 1

tunable chooseleaf_vary_r 1

tunable straw_calc_version 1

 

# devices

device 0 osd.0 class ssd

device 1 osd.1 class ssd

device 2 osd.2 class ssd

device 3 osd.3 class ssd

device 4 osd.4 class ssd

device 5 osd.5 class ssd

device 6 osd.6 class ssd

device 7 osd.7 class ssd

device 8 osd.8 class ssd

device 9 osd.9 class ssd

device 10 osd.10 class ssd

device 11 osd.11 class ssd

device 12 osd.12 class ssd

device 13 osd.13 class ssd

device 14 osd.14 class ssd

device 15 osd.15 class ssd

device 16 osd.16 class ssd

device 17 osd.17 class ssd

device 18 osd.18 class ssd

device 19 osd.19 class ssd

device 20 osd.20 class ssd

device 21 osd.21 class ssd

device 22 osd.22 class ssd

device 23 osd.23 class ssd

device 24 osd.24 class ssd

device 25 osd.25 class ssd

device 26 osd.26 class ssd

device 27 osd.27 class ssd

device 28 osd.28 class ssd

device 29 osd.29 class ssd

device 30 osd.30 class ssd

device 31 osd.31 class ssd

device 32 osd.32 class ssd

device 33 osd.33 class ssd

device 34 osd.34 class ssd

device 35 osd.35 class ssd

device 36 osd.36 class ssd

device 37 osd.37 class ssd

device 38 osd.38 class ssd

device 39 osd.39 class ssd

device 40 osd.40 class ssd

device 41 osd.41 class ssd

device 42 osd.42 class ssd

device 43 osd.43 class ssd

device 44 osd.44 class ssd

device 45 osd.45 class ssd

device 46 osd.46 class ssd

device 47 osd.47 class ssd

device 49 osd.49 class ssd

 

# types

type 0 osd

type 1 host

type 2 chassis

type 3 rack

type 4 row

type 5 pdu

type 6 pod

type 7 room

type 8 datacenter

type 9 region

type 10 root

 

# buckets

host Pnode01 {

id -8   # do not change unnecessarily

id -9 class ssd # do not change unnecessarily

# weight 0.000

alg straw2

hash 0  # rjenkins1

}

host node01 {

id -2   # do not change unnecessarily

id -15 class ssd# do not change unnecessarily

# weight 14.537

alg straw2

hash 0  # rjenkins1

item osd.4 weight 1.817

item osd.1 weight 1.817

item osd.3 weight 1.817

item osd.2 weight 1.817

item osd.6 weight 1.817

item osd.9 weight 1.817

item osd.5 weight 1.817

item osd.0 weight 1.818

}

host node02 {

id -3   # do not change unnecessarily

id -16 class ssd# do not change unnecessarily

# weight 14.536

alg straw2

hash 0  # rjenkins1

item osd.10 weight 1.817

item osd.11 weight 1.817

item osd.12 weight 1.817

item osd.13 weight 1.817

item osd.14 weight 1.817

item osd.15 weight 1.817

item osd.16 weight 1.817

item osd.19 weight 1.817

}

host node03 {

id -4   # do not change unnecessarily

id -17 class ssd# do not change unnecessarily

# weight 14.536

alg straw2

hash 0  # rjenkins1

item osd.20 weight 1.817

item osd.21 weight 1.817

item osd.22 weight 1.817

item osd.23 weight 1.817

item osd.25 weight 1.817

item osd.26 weight 1.817

item osd.29 weight 1.817

item osd.24 weight 1.817

}

datacenter EDM1 {

id -11  # do not change unnecessarily

id -14 class ssd# do not change unnecessarily

# weight 43.609

alg straw

hash 0  # rjenkins1

item node01 weight 14.537

item node02 weight 14.536

item node03 weight 14.536

}

host node04 {

id -5   # do not change unnecessarily

id -18 class ssd# do not change unnecessarily

# weight 14.536

alg straw2

hash 0  # rjenkins1

item osd.30 weight 1.817

item osd.31 weight 1.817

item osd.32 weight 1.817

item osd.33 weight 1.817

item osd.34 we

[ceph-users] Re: PGs stuck down

2022-11-28 Thread Yanko Davila
Hi Dale

Can you please post the ceph status ? I’m no expert but I would make sure that 
the datacenter you intend to operate (while the connection gets reestablished) 
has two active monitors. Thanks.

Yanko.


> On Nov 29, 2022, at 7:20 AM, Wolfpaw - Dale Corse  wrote:
> 
> Hi All,
> 
> 
> 
> We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
> do very well :( We ended up with 98% of PGs as down.
> 
> 
> 
> This setup has 2 data centers defined, with 4 copies across both, and a
> minimum of size of 1.  We have 1 mon/mgr in each DC, with one in a 3rd data
> center connected to each of the other 2 by VPN.
> 
> 
> 
> When I did a pg query on the PG's that were stuck it said they were blocked
> from coming up because they couldn't contact 2 of the OSDs (located in the
> other data center that it was unable to reach).. but the other 2 were fine.
> 
> 
> 
> I'm at a loss because this was exactly the thing we thought we had set it up
> to prevent.. and with size = 4 and min_size set = 1 I understood that it
> would continue without a problem? :(
> 
> 
> 
> Crush map is below .. if anyone has any ideas? I would sincerely appreciate
> it :)
> 
> 
> 
> Thanks!
> 
> Dale
> 
> 
> 
> # begin crush map
> 
> tunable choose_local_tries 0
> 
> tunable choose_local_fallback_tries 0
> 
> tunable choose_total_tries 50
> 
> tunable chooseleaf_descend_once 1
> 
> tunable chooseleaf_vary_r 1
> 
> tunable straw_calc_version 1
> 
> 
> 
> # devices
> 
> device 0 osd.0 class ssd
> 
> device 1 osd.1 class ssd
> 
> device 2 osd.2 class ssd
> 
> device 3 osd.3 class ssd
> 
> device 4 osd.4 class ssd
> 
> device 5 osd.5 class ssd
> 
> device 6 osd.6 class ssd
> 
> device 7 osd.7 class ssd
> 
> device 8 osd.8 class ssd
> 
> device 9 osd.9 class ssd
> 
> device 10 osd.10 class ssd
> 
> device 11 osd.11 class ssd
> 
> device 12 osd.12 class ssd
> 
> device 13 osd.13 class ssd
> 
> device 14 osd.14 class ssd
> 
> device 15 osd.15 class ssd
> 
> device 16 osd.16 class ssd
> 
> device 17 osd.17 class ssd
> 
> device 18 osd.18 class ssd
> 
> device 19 osd.19 class ssd
> 
> device 20 osd.20 class ssd
> 
> device 21 osd.21 class ssd
> 
> device 22 osd.22 class ssd
> 
> device 23 osd.23 class ssd
> 
> device 24 osd.24 class ssd
> 
> device 25 osd.25 class ssd
> 
> device 26 osd.26 class ssd
> 
> device 27 osd.27 class ssd
> 
> device 28 osd.28 class ssd
> 
> device 29 osd.29 class ssd
> 
> device 30 osd.30 class ssd
> 
> device 31 osd.31 class ssd
> 
> device 32 osd.32 class ssd
> 
> device 33 osd.33 class ssd
> 
> device 34 osd.34 class ssd
> 
> device 35 osd.35 class ssd
> 
> device 36 osd.36 class ssd
> 
> device 37 osd.37 class ssd
> 
> device 38 osd.38 class ssd
> 
> device 39 osd.39 class ssd
> 
> device 40 osd.40 class ssd
> 
> device 41 osd.41 class ssd
> 
> device 42 osd.42 class ssd
> 
> device 43 osd.43 class ssd
> 
> device 44 osd.44 class ssd
> 
> device 45 osd.45 class ssd
> 
> device 46 osd.46 class ssd
> 
> device 47 osd.47 class ssd
> 
> device 49 osd.49 class ssd
> 
> 
> 
> # types
> 
> type 0 osd
> 
> type 1 host
> 
> type 2 chassis
> 
> type 3 rack
> 
> type 4 row
> 
> type 5 pdu
> 
> type 6 pod
> 
> type 7 room
> 
> type 8 datacenter
> 
> type 9 region
> 
> type 10 root
> 
> 
> 
> # buckets
> 
> host Pnode01 {
> 
>id -8   # do not change unnecessarily
> 
>id -9 class ssd # do not change unnecessarily
> 
># weight 0.000
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
> }
> 
> host node01 {
> 
>id -2   # do not change unnecessarily
> 
>id -15 class ssd# do not change unnecessarily
> 
># weight 14.537
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.4 weight 1.817
> 
>item osd.1 weight 1.817
> 
>item osd.3 weight 1.817
> 
>item osd.2 weight 1.817
> 
>item osd.6 weight 1.817
> 
>item osd.9 weight 1.817
> 
>item osd.5 weight 1.817
> 
>item osd.0 weight 1.818
> 
> }
> 
> host node02 {
> 
>id -3   # do not change unnecessarily
> 
>id -16 class ssd# do not change unnecessarily
> 
># weight 14.536
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.10 weight 1.817
> 
>item osd.11 weight 1.817
> 
>item osd.12 weight 1.817
> 
>item osd.13 weight 1.817
> 
>item osd.14 weight 1.817
> 
>item osd.15 weight 1.817
> 
>item osd.16 weight 1.817
> 
>item osd.19 weight 1.817
> 
> }
> 
> host node03 {
> 
>id -4   # do not change unnecessarily
> 
>id -17 class ssd# do not change unnecessarily
> 
># weight 14.536
> 
>alg straw2
> 
>hash 0  # rjenkins1
> 
>item osd.20 weight 1.817
> 
>item osd.21 weight 1.817
> 
>item osd.22 weight 1.817
> 
>item osd.23 weight 1.817
> 
>item osd.25 weight 1.817
> 
>item osd.26

[ceph-users] Ceph Orchestrator (cephadm) stopped doing something

2022-11-28 Thread Volker Racho
Hi,

ceph orch commands are not executed anymore in my cephadm-managed cluster
(17.2.3) and I don't see why. Cluster is healthy and overall working,
except for the orchestrator part.

For instance, when I run `ceph orch redeploy ingress.rgw.default`, I see
the command in audit logs, cephadm also logs the command and
"_kick_serve_loop" and that's it. No more messages or errors (also not in
logs with debug level: ceph config set mgr mgr/cephadm/log_to_cluster_level
debug; ceph -W cephadm --watch-debug) but it never redeploys the service.

Nov 21 07:54:45 ceph-0.yy..net bash[1262]: debug
2022-11-21T07:54:45.397+ 7f7b6b527700  0 log_channel(audit) log [DBG] :
from='client.38766115 -' entity='client.admin' cmd=[{"prefix": "orch",
"action": "redeploy", "service_nam
Nov 21 07:54:45 ceph-0.yy..net bash[1262]: debug
2022-11-21T07:54:45.401+ 7f7b6bd28700  0 [cephadm INFO root] Redeploy
service ingress.rgw.default
Nov 21 07:54:45 ceph-0.yy..net bash[1262]: debug
2022-11-21T07:54:45.401+ 7f7b6bd28700  0 log_channel(cephadm) log [INF]
: Redeploy service ingress.rgw.default
Nov 21 07:54:45 ceph-0.yy..net bash[1262]: debug
2022-11-21T07:54:45.401+ 7f7b6bd28700  0 log_channel(cephadm) log [DBG]
: _kick_serve_loop
Nov 21 07:54:45 ceph-0.yy..net bash[1262]: debug
2022-11-21T07:54:45.401+ 7f7b6bd28700  0 log_channel(cephadm) log [DBG]
: _kick_serve_loop

Same behaviour for many other ceph orch ... command including ceph orch
upgrade.

# ceph orch status
Backend: cephadm
Available: Yes
Paused: No

According to status, orchestrator is available and not paused. I have tried
to set the backend to "" and reset to "cephadm", I paused and resumed the
orchestrator, cleared progress entries and such but nothing could make the
cluster execute the commands. SSH connections between hosts are working.

Any ideas how to fix or even debug this? I am a bit lost on this.

Regards, SW.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io