[ceph-users] Re: Data loss on appends, prod outage

2021-09-08 Thread Frank Schilder
Hi Nathan, thanks for the update. This seems to be a different and worse instance than the centos 7 case. We are using Centos 8 Stream for a few clients. I will check if they are affected. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 __

[ceph-users] Re: Data loss on appends, prod outage

2021-09-08 Thread Frank Schilder
Can you make the devs aware of the regression? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nathan Fish Sent: 08 September 2021 19:33 To: ceph-users Subject: [ceph-users] Re: Data loss on appends, prod outage

[ceph-users] ceph fs re-export with or without NFS async option

2021-09-08 Thread Frank Schilder
Hi all, I have a question about a ceph fs re-export via nfsd. For NFS v4 mounts the exports option sync is now the default instead of async. I just made the experience that using async gives more than a factor 10 performance improvement. I couldn't find any advice within ceph community informat

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Konstantin Shalygin
I think It's just a compat with legacy (v1) clusters. In the kernel the same. Your cluster already msgr2 enabled, you don't need any compats k Sent from my iPhone > On 8 Sep 2021, at 22:53, Tony Liu wrote: > > Good to know. Thank you Konstantin!=0A= > Will test it out.=0A= > Is this some kno

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Tony Liu
Good to know. Thank you Konstantin! Will test it out. Is this some known issue? Any tracker or fix? Thanks! Tony From: Konstantin Shalygin Sent: September 8, 2021 12:47 PM To: Tony Liu Cc: ceph-users@ceph.io; d...@ceph.io Subject: Re: [ceph-users] debug RB

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Konstantin Shalygin
Try to simplify it to [global] fsid = 35d050c0-77c0-11eb-9242-2cea7ff9d07c mon_host = 10.250.50.80:3300,10.250.50.81:3300,10.250.50.82:3300 And try again We are found that with only msgr2 enabled clusters, clients with mon_host settings without hardcoded 3300 port may be timeouted from time to

[ceph-users] Re: tcmu-runner crashing on 16.2.5

2021-09-08 Thread Paul Giralt (pgiralt)
Thank you Xiubo. confirm=true worked and I was able to update via gwcli and then get everything reset back to normal again. I’m stable for now but still hoping that this fix can get in soon to make sure the crash doesn’t happen again. Appreciate all your help on this. -Paul On Sep 6, 2021, a

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Tony Liu
Here it is. [global] fsid = 35d050c0-77c0-11eb-9242-2cea7ff9d07c mon_host = [v2:10.250.50.80:3300/0,v1:10.250.50.80:6789/0] [v2:10.250.50.81:3300/0,v1:10.250.50.81:6789/0] [v2:10.250.50.82:3300/0,v1:10.250.50.82:6789/0] Thanks! Tony From: Konstantin Sha

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Konstantin Shalygin
In previous email I was ask you to show your ceph.conf... k > On 8 Sep 2021, at 22:20, Tony Liu wrote: > > Sorry Konstantin, I didn't get it. Could you elaborate a bit? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email

[ceph-users] Re: Kworker 100% with ceph-msgr (after upgrade to 14.2.6?)

2021-09-08 Thread Marc
Hi Samuel, I am not really that fortunate to be in an environment that would allow to trace and resolve incidents with a ceph production cluster quickly before clients start complaining. So I am more or less forced to choose a path that is least likely to fail. Rbd has been longer around than c

[ceph-users] Re: Data loss on appends, prod outage

2021-09-08 Thread Nathan Fish
The bug appears to have already been reported: https://tracker.ceph.com/issues/51948 Also, it should be noted that the write append bug does sometimes occur when writing from a single client, so controlling write patterns is not sufficient to stop data loss. On Wed, Sep 8, 2021 at 1:39 PM Frank S

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Konstantin Shalygin
This may be just a connection string problem k > On 8 Sep 2021, at 19:59, Tony Liu wrote: > > That's what I am trying to figure out, "what exactly could cause a timeout". > User creates 10 VMs (boot on volume and an attached volume) by Terraform, > then destroy them. Repeat the same, it works

[ceph-users] Re: Ceph dashboard pointing to the wrong grafana server address in iframe

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks Ernesto. ceph dashboard set-grafana-api-url fixed the problem. I’m not sure how it got set to the wrong server (I am using cephadm and I’m the only administrator) but at least it’s fixed now, so appreciate the help. -Paul On Sep 8, 2021, at 1:45 PM, Ernesto Puerta mailto:epuer...@redh

[ceph-users] Re: Ceph dashboard pointing to the wrong grafana server address in iframe

2021-09-08 Thread Ernesto Puerta
Hi Paul, You can check what's the currently set value with: [1] $ ceph mgr dashboard get-dashboard-api-url In some set-ups (multi-homed, proxied, ...), you might also need to set up the user-facing IP: [2] $ ceph dashboard set-grafana-frontend-api-url If you're running a Cephadm-deployed cl

[ceph-users] Re: Data loss on appends, prod outage

2021-09-08 Thread Nathan Fish
Rolling back to kernel 5.4 has resolved the issue. On Tue, Sep 7, 2021 at 3:51 PM Frank Schilder wrote: > > Hi Nathan, > > > Is this the bug you are referring to? https://tracker.ceph.com/issues/37713 > > yes, its one of them. I believe there were more such reports. > > > The main prod filesystem

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Tony Liu
That's what I am trying to figure out, "what exactly could cause a timeout". User creates 10 VMs (boot on volume and an attached volume) by Terraform, then destroy them. Repeat the same, it works fine most times, timeout happens sometimes at different places, volume creation or volume deletion. Sin

[ceph-users] Ceph dashboard pointing to the wrong grafana server address in iframe

2021-09-08 Thread Paul Giralt (pgiralt)
For some reason, the grafana dashboards in the dashboard are all pointing to a node that does not and has never run the grafana / Prometheus services. I’m not sure where this value is kept and how to change to back. My two manager nodes are 10.122.242.196 and 10.122.242.198. For some reason, t

[ceph-users] Re: Bucket deletion is very slow.

2021-09-08 Thread mhnx
Hello again. I came with a different question. If a bucket has "fill_status": "OVER 100.00%" do I need to use --inconsistent-index parameter? --inconsistent-index When specified with bucket deletion and bypass-gc set to true, ignores bucket index consistency. mhnx

[ceph-users] Re: Edit crush rule

2021-09-08 Thread Konstantin Shalygin
Just create new one with your failure domain and switch the pool rule. Then delete old rule k Sent from my iPhone > On 8 Sep 2021, at 01:11, Budai Laszlo wrote: > > Thank you for your answers. Yes, I'm aware of this option, but this is not > changing the failure domain of an existing rule.

[ceph-users] Re: ceph jobs

2021-09-08 Thread Janne Johansson
Den ons 8 sep. 2021 kl 16:32 skrev Sage Weil : > Hi everyone, > We set up a pad to collect Ceph-related job listings. If you're > looking for a job, or have a Ceph-related position to advertise, take > a look: > https://pad.ceph.com/p/jobs Thanks. One position added in Scandinavia. -- May the

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Konstantin Shalygin
What is ceoh.conf for this rbd client? k Sent from my iPhone > On 7 Sep 2021, at 19:54, Tony Liu wrote: > > > I have OpenStack Ussuri and Ceph Octopus. Sometimes, I see timeout when create > or delete volumes. I can see RBD timeout from cinder-volume. Has anyone seen > such > issue? I'd lik

[ceph-users] ceph jobs

2021-09-08 Thread Sage Weil
Hi everyone, We set up a pad to collect Ceph-related job listings. If you're looking for a job, or have a Ceph-related position to advertise, take a look: https://pad.ceph.com/p/jobs sage ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscr

[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks for the tip. I’ve just been using ‘docker exec -it /bin/bash’ to get into the containers, but those commands sound useful. I think I’ll install cephadm on all nodes just for this. Thanks again, -Paul > On Sep 8, 2021, at 10:11 AM, Eugen Block wrote: > > Okay, I'm glad it worked! >

[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Eugen Block
Okay, I'm glad it worked! At first I tried cephadm rm-daemon on the bootstrap node that I usually do all management from and it indicated that it could not remove the daemon: [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b

[ceph-users] Re: radosgw manual deployment

2021-09-08 Thread Eugen Block
Hi, I checked our environment (Nautilus) where I enabled the RGW dashboard integration. Please note that we don't use RGW ourselves heavily and I don't have access to our customer's RGWs, so this might look differently for an actual prod environment. Anyway, to get it up and running it co

[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Paul Giralt (pgiralt)
Thanks Eugen. At first I tried cephadm rm-daemon on the bootstrap node that I usually do all management from and it indicated that it could not remove the daemon: [root@cxcto-c240-j27-01 ~]# cephadm rm-daemon --name iscsi.cxcto-c240-j27-04.lgqtxo --fsid 4a29e724-c4a6-11eb-b14a-5c838f8013a5 ER

[ceph-users] Re: [Ceph Upgrade] - Rollback Support during Upgrade failure

2021-09-08 Thread Matthew Vernon
Hi, On 06/09/2021 08:37, Lokendra Rathour wrote: Thanks, Mathew for the Update. The upgrade got failed for some random wired reasons, Checking further Ceph's status shows that "Ceph health is OK" and times it gives certain warnings but I think that is ok. OK... but what if we see the Versio

[ceph-users] Re: cephfs_metadata pool unexpected space utilization

2021-09-08 Thread Eugen Block
I assume the cluster is used in roughly the same way as before the upgrade and the load has not increased since, correct? What is the usual load, can you share some 'ceph daemonperf mds.' output? It might be unrelated but have you tried to compact the OSDs belonging to this pool, online or

[ceph-users] Re: [EXTERNAL] Re: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

2021-09-08 Thread Dave Piper
We've started hitting this issue again, despite having bitmap allocator configured. The logs just before the crash look similar to before (pasted below). So perhaps this isn't a hybrid allocator issue after all? I'm still struggling to collect the full set of diags / run ceph-bluestore-tool c

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
I forgot to mention, the progress not updating is a seperate bug, you can fail the mgr (ceph mgr fail ceph1a.guidwn in your example) to resolve that. On the monitor side, I assume you deployed using labels? If so - just remove the label from the host where the monitor did not start, let it fully un

[ceph-users] Re: ceph progress bar stuck and 3rd manager not deploying

2021-09-08 Thread David Orman
This sounds a lot like: https://tracker.ceph.com/issues/51027 which is fixed in https://github.com/ceph/ceph/pull/42690 David On Tue, Sep 7, 2021 at 7:31 AM mabi wrote: > > Hello > > I have a test ceph octopus 16.2.5 cluster with cephadm out of 7 nodes on > Ubuntu 20.04 LTS bare metal. I just u

[ceph-users] Re: Cephadm not properly adding / removing iscsi services anymore

2021-09-08 Thread Eugen Block
If you only configured 1 iscsi gw but you see 3 running, have you tried to destroy them with 'cephadm rm-daemon --name ...'? On the active MGR host run 'journalctl -f' and you'll see plenty of information, it should also contain information about the iscsi deployment. Or run 'cephadm logs -

[ceph-users] Re: debug RBD timeout issue

2021-09-08 Thread Eugen Block
Hi, from an older cloud version I remember having to increase these settings: [DEFAULT] block_device_allocate_retries = 300 block_device_allocate_retries_interval = 10 block_device_creation_timeout = 300 The question is what exactly could cause a timeout. You write that you only see these ti

[ceph-users] Re: Kworker 100% with ceph-msgr (after upgrade to 14.2.6?)

2021-09-08 Thread huxia...@horebdata.cn
Dear Marc, Is there specific reason for "not to use the cephfs for important things" ? What the major concerns then? thanks, samuel huxia...@horebdata.cn From: Marc Date: 2021-09-07 20:37 To: Frank Schilder CC: ceph-users Subject: [ceph-users] Re: Kworker 100% with ceph-msgr (after upgrad