[ceph-users] Re: bidirectional rbd-mirroring

2023-01-18 Thread Aielli, Elia
That's an idea, moreover I've discovered my two clusters can't be named the same (i.e. ceph) but I have to change cluster name with environment variable in /etc/default/ceph (I've deployed ceph via Proxmox 6.x and that's the name it gives by default to my cluster) This is kind of an issue cause i

[ceph-users] 17.2.5 ceph fs status: AssertionError

2023-01-18 Thread Robert Sander
Hi, I have a healthy (test) cluster running 17.2.5: root@cephtest20:~# ceph status cluster: id: ba37db20-2b13-11eb-b8a9-871ba11409f6 health: HEALTH_OK services: mon: 3 daemons, quorum cephtest31,cephtest41,cephtest21 (age 2d) mgr: cephtest22.lqzdnk(acti

[ceph-users] Re: Stable erasure coding CRUSH rule for multiple hosts?

2023-01-18 Thread Eugen Block
Hi, I only have one remark on your assumption regarding maintenance with your current setup. With your profile k4 m2 you'd have a min_size of 5 (k + 1 which is recommended), taking one host down would still result in IO pause because min_size is not met. To allow IO you'd need to reduce m

[ceph-users] Re: Ceph Community Infrastructure Outage

2023-01-18 Thread Marc
> As services grew, we relied > more and more on its legacy storage solution, which was never migrated to > Ceph. Over the last few months, this legacy storage solution had several > instances of silent data corruption, rendering the VMs unbootable, taking > down various services, and requiring res

[ceph-users] [RFC] Detail view of OSD network I/O

2023-01-18 Thread Nico Schottelius
Good morning ceph community, for quite some time I was wondering if it would not make sense to add an iftop alike interface to ceph that shows network traffic / iops on a per IP basis? I am aware of "rbd perf image iotop", however I am much more interested into a combined metric featuring 1) Wh

[ceph-users] Re: Ceph-ansible: add a new HDD to an already provisioned WAL device

2023-01-18 Thread Guillaume Abrioux
Hi Len, Indeed, this is not possible with ceph-ansible. One option would be to do it manually with `ceph-volume lvm migrate`: (Note that it can be tedious given that it requires a lot of manual operations, especially for clusters with a large number of OSDs.) Initial setup: ``` # cat group_vars/

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Kotresh Hiremath Ravishankar
Hi Thomas, This looks like it requires more investigation than I expected. What's the current status ? Did the crashed mds come back and become active ? Increase the debug log level to 20 and share the mds logs. I will create a tracker and share it here. You can upload the mds logs there. Thanks

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Kotresh Hiremath Ravishankar
Hi Thomas, I have created the tracker https://tracker.ceph.com/issues/58489 to track this. Please upload the debug mds logs here. Thanks, Kotresh H R On Wed, Jan 18, 2023 at 4:56 PM Kotresh Hiremath Ravishankar < khire...@redhat.com> wrote: > Hi Thomas, > > This looks like it requires more inve

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Thomas Widhalm
Thank you. I'm setting the debug level and await authorization for Tracker. I'll upload the logs as soon as I can collect them. Thank you so much for your help On 18.01.23 12:26, Kotresh Hiremath Ravishankar wrote: Hi Thomas, This looks like it requires more investigation than I expected. Wha

[ceph-users] Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Frank Schilder
Hi all, we are observing a problem on a libvirt virtualisation cluster that might come from ceph rbd clients. Something went wrong during execution of a live-migration operation and as a result we have two instances of the same VM running on 2 different hosts, the source- and the destination ho

[ceph-users] Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
Hi, We have a full SSD production cluster running on Pacific 16.2.10 and deployed with cephadm that is experiencing OSD flapping issues. Essentially, random OSDs will get kicked out of the cluster and then automatically brought back in a few times a day. As an example, let's take the case of

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Ilya Dryomov
On Wed, Jan 18, 2023 at 1:19 PM Frank Schilder wrote: > > Hi all, > > we are observing a problem on a libvirt virtualisation cluster that might > come from ceph rbd clients. Something went wrong during execution of a > live-migration operation and as a result we have two instances of the same VM

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Danny Webb
Do you have any network congestion or packet loss on the replication network? are you sharing nics between public / replication? That is another metric that needs looking into. From: J-P Methot Sent: 18 January 2023 12:42 To: ceph-users Subject: [ceph-users] F

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Anthony D'Atri
This was my first thought as well, especially if the OSDs log something like “wrongly marked down”. It’s one of the reasons why I favor not having a replication network. > On Jan 18, 2023, at 8:28 AM, Danny Webb wrote: > > Do you have any network congestion or packet loss on the replication n

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
At the network level we're using bonds (802.3ad). There are 2 nics, each with two 25gbps port. 1 port per nic is used for the public network, the other for the replication network. That suggests a network bandwidth of 50gbps (in theory) for each network load. The network graph is showing me loa

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Frank Schilder
Hi Ilya, thanks a lot for the information. Yes, I was talking about the exclusive lock feature and was under the impression that only one rbd client can get write access on connect and will keep it until disconnect. The problem we are facing with multi-VM write access is, that this will inevita

[ceph-users] Re: Ceph-ansible: add a new HDD to an already provisioned WAL device

2023-01-18 Thread Len Kimms
Hi Guillaume, thank you very much for the quick clarification and elaborate workaround. We’ll check if manual migration is feasible with our setup with respect to the time needed. Alternatively, we’re looking into completely redeploying all affected OSDs (i.e. shrinking the cluster with ceph-an

[ceph-users] Re: ceph orch osd spec questions

2023-01-18 Thread Wyll Ingersoll
In case anyone was wondering, I figured out the problem... This nasty bug in Pacific 16.2.10 https://tracker.ceph.com/issues/56031 - I think it is fixed in the upcoming .11 release and in Quincy. This bug causes the computation of the bluestore DB partition to be much smaller than it shoul

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Frank Schilder
Do you have CPU soft lock-ups around these times? We had these timeouts due to using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout is typical when CPU lockups happen. Could be a sporadic problem with the disk IO path. Best regards, = Frank Schilder AIT Risø

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
There's nothing in the CPU graph that suggests soft lock-ups at these times. However, thank you for pointing out that the disk io scheduler could have an impact. Ubuntu seems to be on mq-deadline by default, so we just switched to none, as it fits our workload best I believe. I don't know if th

[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Frank Schilder
I'm not sure what you look for in the CPU graph. If its load or a similar metric you will not see these lock-ups. You need to look into the syslog and search for it. If these warnings are there, it might give give a clue as to what hardware component is causing it. They look something like "BUG:

[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Ilya Dryomov
On Wed, Jan 18, 2023 at 3:25 PM Frank Schilder wrote: > > Hi Ilya, > > thanks a lot for the information. Yes, I was talking about the exclusive lock > feature and was under the impression that only one rbd client can get write > access on connect and will keep it until disconnect. The problem we

[ceph-users] MDS crash in "inotablev == mds->inotable->get_version()"

2023-01-18 Thread Kenny Van Alstyne
Hey all! I’ve run into an MDS crash on a cluster recently upgraded from Ceph 16.2.7 to 16.2.10. I’m hitting an assert nearly identical to this one gathered by the telemetry module: https://tracker.ceph.com/issues/54747 I have a new build compiling to test whether https://github.com/ce

[ceph-users] ceph quincy rgw openstack howto

2023-01-18 Thread Shashi Dahal
Hi, How to set up values for rgw_keystone_url and other related fields that are not possible to change via the GUI under cluster configuration. ? ceph qunicy is deployed using cephadm. -- Cheers, Shashi ___ ceph-users mailing list -- ceph-users@cep

[ceph-users] Re: 17.2.5 ceph fs status: AssertionError

2023-01-18 Thread Robert Sander
Am 18.01.23 um 10:12 schrieb Robert Sander: root@cephtest20:~# ceph fs status Error EINVAL: Traceback (most recent call last):   File "/usr/share/ceph/mgr/mgr_module.py", line 1757, in _handle_command     return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)   File "/usr/share/ceph