[ceph-users] Re: [EXTERNAL] Re: Renaming a ceph node

2023-02-16 Thread Eugen Block
I'm glad it worked for you. We used the 'swap-bucket' command when we  
needed to replace an OSD node without waiting for the draining of the  
old one to finish and then wait for the backfilling of the new one. I  
created a temporary bucket where I moved the old host (ceph osd crush  
move  ), the data was still available, and then  
issued 'swap-bucket' which started to drain the old host directly to  
the new host. This saved a lot of time and network traffic.


Zitat von "Rice, Christian" :

Hi all, so I used the rename-bucket option this morning for OSD node  
renames, and it was a success.  Works great even on Luminous.


I looked at the swap-bucket command and I felt it was leaning toward  
real data migration from old OSDs to new OSDs and I was a bit timid  
because there wasn’t a second host, just a name change.  So when I  
looked at rename-bucket, it just seemed too simple not to try first.  
 And I did, and it was.  I renamed two host buckets (they housed  
discrete storage classes, so no dangerous loss of data redundancy),  
and even some rack buckets.


sudo ceph osd crush rename-bucket  

and no data moved.  I first thought I’d wait til the hosts were  
shutdown, but after I stopped the OSDs on the nodes, it seemed safe  
enough, and it was.


In my particular case, I was moving migrating nodes to a new  
datacenter, just  new names and IPs.  I also moved a mon/mgr/rgw;  
and I merely had to delete the mon first, then reprovision it later.


The rgw and mgr worked fine.  I pre-edited ceph.conf to add the new  
networks, remove the old mon name, add the new mon name, so on  
startup it worked.


I’m not a ceph admin but I play one on the tele.

From: Eugen Block 
Date: Wednesday, February 15, 2023 at 12:44 AM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] Re: Renaming a ceph node
Hi,

I haven't done this in a production cluster yet, only in small test
clusters without data. But there's a rename-bucket command:

ceph osd crush rename-bucket  
  rename bucket  to 

It should do exactly that, just rename the bucket within the crushmap
without changing the ID. That command also exists in Luminous, I
believe. To have an impression of the impact I'd recommend to test in
a test cluster first.

Regards,
Eugen


Zitat von Manuel Lausch :


Hi,

yes you can rename a node without massive rebalancing.

The following I tested with pacific. But I think this should work with
older versions as well.
You need to rename the node in the crushmap between shutting down the
node with the old name and starting it with the new name.
You only must keep the ID from the node in the crushmap!

Regards
Manuel


On Mon, 13 Feb 2023 22:22:35 +
"Rice, Christian"  wrote:


Can anyone please point me at a doc that explains the most
efficient procedure to rename a ceph node WITHOUT causing a massive
misplaced objects churn?

When my node came up with a new name, it properly joined the
cluster and owned the OSDs, but the original node with no devices
remained.  I expect this affected the crush map such that a large
qty of objects got reshuffled.  I want no object movement, if
possible.

BTW this old cluster is on luminous. ☹

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to sync data on two site CephFS

2023-02-16 Thread zxcs
Hi, Experts,

we  already have a CephFS cluster, called A,  and now we want to setup another 
CephFS cluster(called B) in other site.
And we need to  synchronize data with each other for some directory(if all 
directory can synchronize , then very very good), Means when we write a file in 
A cluster, this file can auto sync to B cluster, and when we create a file or 
directory on B Cluster, this file or directory can auto sync to A Cluster.

our question is does there any best practices to do that on CephFS?

Thanks in advance!


Thanks,
zx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to sync data on two site CephFS

2023-02-16 Thread Eugen Block

Hi,

if you have Pacific or later running you might want look into CephFS  
mirroring [1]. Basically, it's about (asynchronous) snapshot mirroring:


For a given snapshot pair in a directory, cephfs-mirror daemon will  
rely on readdir diff to identify changes in a directory tree. The  
diffs are applied to directory in the remote file system thereby  
only synchronizing files that have changed between two snapshots.  
This feature is tracked here: https://tracker.ceph.com/issues/47034.
Currently, snapshot data is synchronized by bulk copying to the  
remote filesystem.


Regards,
Eugen

[1] https://docs.ceph.com/en/latest/dev/cephfs-mirroring/

Zitat von zxcs :


Hi, Experts,

we  already have a CephFS cluster, called A,  and now we want to  
setup another CephFS cluster(called B) in other site.
And we need to  synchronize data with each other for some  
directory(if all directory can synchronize , then very very good),  
Means when we write a file in A cluster, this file can auto sync to  
B cluster, and when we create a file or directory on B Cluster, this  
file or directory can auto sync to A Cluster.


our question is does there any best practices to do that on CephFS?

Thanks in advance!


Thanks,
zx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to sync data on two site CephFS

2023-02-16 Thread Robert Sander

Hi,

On 16.02.23 12:53, zxcs wrote:


we  already have a CephFS cluster, called A,  and now we want to setup another 
CephFS cluster(called B) in other site.
And we need to  synchronize data with each other for some directory(if all 
directory can synchronize , then very very good), Means when we write a file in 
A cluster, this file can auto sync to B cluster, and when we create a file or 
directory on B Cluster, this file or directory can auto sync to A Cluster.



Ceph has CephFS snapshot mirroring: 
https://docs.ceph.com/en/latest/cephfs/cephfs-mirroring/


But this is a one way mirror. It only supports A -> B.

You need a two way sync. There is software like unison available for 
that task: https://en.wikipedia.org/wiki/Unison_(software)


If you do not have too many or too large directories you could let 
unison run regularily. But it will bail on conflicts, meaning it has to 
ask what to do if a file has been changed on both sides.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW archive zone lifecycle

2023-02-16 Thread J. Eric Ivancich
> On Feb 7, 2023, at 6:07 AM, ond...@kuuk.la wrote:
> 
> Hi,
> 
> I have two Ceph clusters in a multi-zone setup. The first one (master zone) 
> would be accessible to users for their interaction using RGW.
> The second one is set to sync from the master zone with the tier type of the 
> zone set as an archive (to version all files).
> 
> My question here is. Is there an option to set a lifecycle for the version 
> files saved on the archive zone? For example, keep only 5 versions per file 
> or delete version files older than one year?
> 
> Thanks a lot.

It appears that this feature is in main and I’m guessing will likely be 
included in reef. There is a tracker for a backport to quincy but no one has 
done it yet. See:

https://tracker.ceph.com/issues/53361
https://tracker.ceph.com/issues/56440
https://github.com/ceph/ceph/pull/46928

In the PR linked immediately above, looking at the additions to the file 
src/test/rgw/test_rgw_lc.cc, you can find this XML snippet:


   
  
  
 spongebob
 squarepants
  



Eric
(he/him)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: User + Dev monthly meeting happening tomorrow, Feb. 16th!

2023-02-16 Thread Laura Flores
There are no topics on the agenda, so I'm cancelling the meeting.

On Wed, Feb 15, 2023 at 11:55 AM Laura Flores  wrote:

> Hi Ceph Users,
>
> The User + Dev monthly meeting is coming up tomorrow, Thursday, Feb. 16th
> at 3:00 PM UTC.
>
> Please add any topics you'd like to discuss to the agenda:
> https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
> 
>
> See you there,
> Laura Flores
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage
>
> Red Hat Inc. 
>
> Chicago, IL
>
> lflo...@redhat.com
> M: +17087388804
> @RedHat    Red Hat
>   Red Hat
> 
> 
>
>

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage

Red Hat Inc. 

Chicago, IL

lflo...@redhat.com
M: +17087388804
@RedHat    Red Hat
  Red Hat


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: clt meeting summary [15/02/2023]

2023-02-16 Thread Nizamudeen A
Maybe an etherpad and pinning that to #sepia channel.


On Wed, Feb 15, 2023, 23:32 Laura Flores  wrote:

> I would be interested in helping catalogue errors and fixes we experience
> in the lab. Do we have a preferred platform for this cheatsheet?
>
> On Wed, Feb 15, 2023 at 11:54 AM Nizamudeen A  wrote:
>
>> Hi all,
>>
>> today's topics were:
>>
>>- Labs:
>>   - Keeping a catalog
>>   - Have a dedicated group to debug/work through the issues.
>>   - Looking for interested parties that would like to contribute in
>>   the lab maintenance tasks
>>   - Poll for meeting time, looking for a central person to follow up
>>   / organize
>>   - No one's been actively coordinating on the lab issues apart from
>>   Laura. David Orman volunteered if we need help coordinating the lab 
>> issues
>>- Reef release
>>   - [casey] things aren't looking good for end-of-february freeze
>>   - Since the whole thing depends on test-infra, can't really
>>   estimate the time frame.
>>   - The freeze maybe delayed
>>- Dev Summit in Amsterdam: estimate how many would attend in person,
>>remote
>>- 50/50 of those present would attend (as per the voting)
>>   - Ad hoc virtual could work
>>- Need to update the component leads page:
>>https://ceph.io/en/community/team/
>>- Vikhyath volunteered before, so Josh will check with him.
>>
>>
>> Regards,
>> --
>>
>> Nizamudeen A
>>
>> Software Engineer
>>
>> Red Hat 
>> 
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage
>
> Red Hat Inc. 
>
> Chicago, IL
>
> lflo...@redhat.com
> M: +17087388804
> @RedHat    Red Hat
>   Red Hat
> 
> 
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW Service SSL HAProxy.cfg

2023-02-16 Thread Jimmy Spets

Hi

I am trying to setup the “High availability service for RGW” using SSL 
both to the HAProxy and from the HAProxy to the RGW backend.
The SSL certificate gets applied to both HAProxy and the RGW. If I use 
the RGW instances directly they work as expected.


The RGW config is as follows:

service_type: rgw
service_id: rgw
service_name: rgw.rgw
placement:
label: rgw
count_per_host: 2
spec:
ssl: true
rgw_frontend_port: 6443
rgw_frontend_ssl_certificate: |
-BEGIN CERTIFICATE
-END PRIVATE KEY-

Ingress as follows:

service_type: ingress
service_id: rgw.rgw
placement:
hosts:
- cephrgw01
- cephrgw02
- cephrgw03
spec:
backend_service: rgw.rgw
virtual_ip: 172.16.1.130/16
frontend_port: 443
monitor_port: 1967
ssl_cert: |
-BEGIN CERTIFICATE-
-END CERTIFICATE-

The issue is that the haproxy.cfg gets generated like this, without SSL 
enabled on the backends:


# This file is generated by cephadm.
global
log127.0.0.1 local2
chroot/var/lib/haproxy
pidfile/var/lib/haproxy/haproxy.pid
maxconn8000
daemon
stats socket /var/lib/haproxy/stats

defaults
modehttp
logglobal
optionhttplog
optiondontlognull
option http-server-close
option forwardforexcept 127.0.0.0/8
optionredispatch
    retries3
timeout queue20s
timeout connect5s
timeout http-request1s
timeout http-keep-alive 5s
timeout client1s
timeout server1s
timeout check5s
maxconn8000

frontend stats
mode http
bind 172.16.1.130:1967
bind localhost:1967
stats enable
stats uri /stats
stats refresh 10s
stats auth admin:abcdefg
http-request use-service prometheus-exporter if { path /metrics }
monitor-uri /health

frontend frontend
bind 172.16.1.130:443 ssl crt /var/lib/haproxy/haproxy.pem
default_backend backend

backend backend
option forwardfor
balance static-rr
option httpchk HEAD / HTTP/1.0
server rgw.rgw.cephrgw01.euvqmd 172.16.1.131:6443 check weight 100
server rgw.rgw.cephrgw01.aphsnx 172.16.1.131:6444 check weight 100
server rgw.rgw.cephrgw02.ovckaw 172.16.1.132:6443 check weight 100
server rgw.rgw.cephrgw02.jevtrb 172.16.1.132:6444 check weight 100
server rgw.rgw.cephrgw03.gzdame 172.16.1.133:6443 check weight 100
server rgw.rgw.cephrgw03.bchspq 172.16.1.133:6444 check weight 100


This of course does not work as the backend use SSL.

Is there some configuration that I have missed or should I file a bug 
report?


/Jimmy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW Service SSL HAProxy.cfg

2023-02-16 Thread Marc
> 
> # This file is generated by cephadm.
> global
> log127.0.0.1 local2
> chroot/var/lib/haproxy
> pidfile/var/lib/haproxy/haproxy.pid
> maxconn8000
> daemon
> stats socket /var/lib/haproxy/stats
> 
> defaults
> modehttp
> logglobal
> optionhttplog
> optiondontlognull
> option http-server-close
> option forwardforexcept 127.0.0.0/8
> optionredispatch
>      retries3
> timeout queue20s
> timeout connect5s
> timeout http-request1s
> timeout http-keep-alive 5s
> timeout client1s
> timeout server1s
> timeout check5s
> maxconn8000
> 
> frontend stats
> mode http
> bind 172.16.1.130:1967
> bind localhost:1967
> stats enable
> stats uri /stats
> stats refresh 10s
> stats auth admin:abcdefg
> http-request use-service prometheus-exporter if { path /metrics }
> monitor-uri /health
> 
> frontend frontend
> bind 172.16.1.130:443 ssl crt /var/lib/haproxy/haproxy.pem
> default_backend backend
> 
> backend backend
> option forwardfor
> balance static-rr
> option httpchk HEAD / HTTP/1.0
> server rgw.rgw.cephrgw01.euvqmd 172.16.1.131:6443 check weight 100
> server rgw.rgw.cephrgw01.aphsnx 172.16.1.131:6444 check weight 100
> server rgw.rgw.cephrgw02.ovckaw 172.16.1.132:6443 check weight 100
> server rgw.rgw.cephrgw02.jevtrb 172.16.1.132:6444 check weight 100
> server rgw.rgw.cephrgw03.gzdame 172.16.1.133:6443 check weight 100
> server rgw.rgw.cephrgw03.bchspq 172.16.1.133:6444 check weight 100
> 
> 
> This of course does not work as the backend use SSL.
> 
> Is there some configuration that I have missed or should I file a bug
> report?

Can this be because of your http check on https? Maybe you have to add ssl at 
the server as well? I have this


option httpchk GET /swift/healthcheck
  ..
  server-template rgw2 1 _https._rgw2.prod.xxx ssl
  server-template rgw1 1 _https._rgw1.prod.xxx ssl


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] forever stuck "slow ops" osd

2023-02-16 Thread Arvid Picciani
Hi,

today our entire cluster froze. or anything that uses librbd to be specific.
ceph version 16.2.10

The message that saved me was "256 slow ops, oldest one blocked for
2893 sec, osd.7 has slow ops" , because it makes it immediately clear
that this osd is the issue.

I stopped the osd, which made the cluster available again. Restarting
the osd makes it stuck again, although that osd has nothing in the
error log, and the underlying ssd is healthy. It's just that one out
of 27. There's nothing unique about it. We use the same disk product
in other osds, and the host is also running other osds just fine.

How does this happen, and why can the cluster not recover from this
automatically? For example by stopping the affected osd or at least
having a timeout for ops.

Thanks



-- 
+4916093821054
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: forever stuck "slow ops" osd

2023-02-16 Thread Eugen Block
Have you tried to dump the stuck ops from that OSD? It could point to  
a misbehaving client, I believe there was a thread about that recently  
in this list. I don’t have the exact command right now but check  
(within cephadm shell) ‚ceph daemon osd.7 help‘ for the ‚dump‘ options.


Zitat von Arvid Picciani :


Hi,

today our entire cluster froze. or anything that uses librbd to be specific.
ceph version 16.2.10

The message that saved me was "256 slow ops, oldest one blocked for
2893 sec, osd.7 has slow ops" , because it makes it immediately clear
that this osd is the issue.

I stopped the osd, which made the cluster available again. Restarting
the osd makes it stuck again, although that osd has nothing in the
error log, and the underlying ssd is healthy. It's just that one out
of 27. There's nothing unique about it. We use the same disk product
in other osds, and the host is also running other osds just fine.

How does this happen, and why can the cluster not recover from this
automatically? For example by stopping the affected osd or at least
having a timeout for ops.

Thanks



--
+4916093821054
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW cannot list or create openidconnect providers

2023-02-16 Thread mat
Hello,

I'm attempting to setup an OpenIDConnect provider with RGW. I'm doing this 
using the boto3 API & Python. However it seems that the APIs are failing in 
some unexpected ways because radosgw was not setup correctly. There is sample 
code below, and yes, I know there are "secrets" in it - but this is an offline 
test lab so I am fine with this.

The first error shows this in the logs.

2023-02-16T00:45:26.860-0500 7fe19fef7700  1 == starting new request 
req=0x7fe2ccb54680 =
2023-02-16T00:45:26.904-0500 7fe19def3700  0 req 17562030806519127926 
0.044000439s ERROR: listing filtered objects failed: OIDC pool: 
default.rgw.meta: oidc_url.: (2) No such file or directory
2023-02-16T00:45:26.904-0500 7fe19aeed700  1 == req done req=0x7fe2ccb54680 
op status=-2 http_status=404 latency=0.044000439s ==
2023-02-16T00:45:26.904-0500 7fe19aeed700  1 beast: 0x7fe2ccb54680: 
10.20.104.178 - authentik [16/Feb/2023:00:45:26.860 -0500] "POST / HTTP/1.1" 
404 189 - "Boto3/1.26.71 Python/3.11.1 Linux/6.0.6-76060006-generic 
Botocore/1.29.72" - latency=0.044000439s

So the object "oidc_url" is missing from the "default.rgw.meta" pool?

rados --pool default.rgw.meta ls --all
users.uid   root.buckets
users.uid   authentik.buckets
roottest4
root.bucket.meta.test2:3866fac0-854b-48b5-b3b7-bf84a166a404.1165645.1
users.keys  ZVBTLTYRRPY7JU39WOR9
users.uid   authentik
users.uid   cephadmin
users.keys  NIVIV0JSKD9D2LDC3IH4
users.uid   root
users.email tes...@lab.dev
users.keys  L70QT3LN71SQXWHS97Y4
root.bucket.meta.test:3866fac0-854b-48b5-b3b7-bf84a166a404.1204730.1
root.bucket.meta.test4:3866fac0-854b-48b5-b3b7-bf84a166a404.1204730.2
roottest
roottest2

Well the object is clearly not there and I do not know how to fix this.

The second error produces this error in the log:

2023-02-16T01:11:29.304-0500 7fe1976e6700  1 == starting new request 
req=0x7fe2ccb54680 =
2023-02-16T01:11:29.312-0500 7fe18c6d0700  1 == req done req=0x7fe2ccb54680 
op status=-22 http_status=400 latency=0.00883s ==
2023-02-16T01:11:29.312-0500 7fe18c6d0700  1 beast: 0x7fe2ccb54680: 
10.20.104.178 - authentik [16/Feb/2023:01:11:29.304 -0500] "POST / HTTP/1.1" 
400 189 - "Boto3/1.26.71 Python/3.11.1 Linux/6.0.6-76060006-generic 
Botocore/1.29.72" - latency=0.00883s

Its much less clear what is going on here, it just returns 400. Boto raises 
this exception, "botocore.exceptions.ClientError: An error occurred (Unknown) 
when calling the CreateOpenIDConnectProvider operation: Unknown".

Has anyone seen this before and know how to setup the correct objects for 
OpenidConnect?

Version info
==
ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy (stable)


Examples below
==

# creating the client works fine - I can see my user authenticate in the 
radosgw logs
access_key_id = 'L70QT3LN71SQXWHS97Y4'
secret_access_key = 'QEXLa5V0Zm38068n3goDtm8V6WlaDwxVmAq9W2XV'
iam = boto3.client('iam',
  aws_access_key_id=access_key_id,
  aws_secret_access_key=secret_access_key,
  region_name="default",
  endpoint_url="https://s3.lab";)

# First error
providers_response = iam.list_open_id_connect_providers()

# Second Error
oidc_response = iam.create_open_id_connect_provider(
  # Issuer URL
  Url="https://login.lab/application/o/d7d64496e26c156ca9ea0802c5d7ed1c/";,
  ClientIDList=['authentik'],
  
ThumbprintList=['BDCC44F40254E7E1258DA4698833FFE2E8AECA3D3799044D8A1F97F7DFF20511'])
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Extremally need help. Openshift cluster is down :c

2023-02-16 Thread kreept . sama
Here, we enable mds debug logging into stdout 
ceph tell mds.gml-okd-cephfs-a config set debug_mds 20/0 

...
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 mds.0.server reply to stat 
on client_request(client.66426408:170 lookup 
#0x101/csi-vol-91510028-3e45-11ec-9461-0a580a82014a 
2023-02-16T09:49:56.266338+ caller_uid=0, caller_gid=0{}) v5
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server 
respond_to_request batch head request(client.66426408:170 nref=3 
cr=0x558f9a150580)
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 respond: responding to batch 
ops with result=0: [batch front=request(client.66426408:170 nref=3 
cr=0x558f9a150580)]
debug 2023-02-16T09:49:56.265+ 7f0462329700  7 mds.0.server 
reply_client_request 0 ((0) Success) client_request(client.66426408:170 lookup 
#0x101/csi-vol-91510028-3e45-11ec-9461-0a580a82014a 
2023-02-16T09:49:56.266338+ caller_uid=0, caller_gid=0{}) v5
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 mds.0.server 
apply_allocated_inos 0x0 / [] / 0x0
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server lat 0.000551
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server set_trace_dist 
snapid head
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 mds.0.server set_trace_dist 
snaprealm snaprealm(0x10001be180e seq 1 lc 0 cr 1 cps 2 snaps={} 
past_parent_snaps= 0x558f9982a200) len=96
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 
mds.0.cache.ino(0x101)  pfile 0 pauth 0 plink 0 pxattr 0 plocal 0 ctime 
2022-12-26T14:29:05.859667+ valid=1
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x101) encode_inodestat issuing pAsLsXsFs seq 56
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x101) encode_inodestat caps pAsLsXsFs seq 56 mseq 0 
xattrv 0
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server set_trace_dist 
added diri [inode 0x101 [...c,head] /volumes/csi/ auth v119902405 f(v0 
m2022-12-26T14:29:05.859667+ 125=3+122) n(v9165118 
rc2023-01-03T22:32:29.158670+ b264347773624 1549126=1354219+194907) 
old_inodes=1 (isnap sync r=1) (iversion lock) caps={66426408=pAsLsXsFs/-@56} | 
request=0 lock=1 dirfrag=1 caps=1 dirty=1 authpin=0 0x558f99801600]
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server set_trace_dist 
added dir  [dir 0x101 /volumes/csi/ [2,head] auth v=400540078 
cv=400540078/400540078 ap=0+2 state=1074003969|complete f(v0 
m2022-12-26T14:29:05.859667+ 125=3+122) n(v9165118 
rc2023-01-03T22:32:29.158670+ b264347773624 1549125=1354219+194906) 
hs=125+0,ss=0+0 | child=1 waiter=0 authpin=0 0x558f9aa22d80]
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.locker 
issue_client_lease no/null lease on [dentry 
#0x1/volumes/csi/csi-vol-91510028-3e45-11ec-9461-0a580a82014a [2,head] auth (dn 
sync r=1) (dversion lock) pv=0 v=400540078 ap=1 ino=0x10001be180e 
state=1073741824 | request=1 lock=1 inodepin=1 authpin=1 0x558f99828780]
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server set_trace_dist 
added dn   head [dentry 
#0x1/volumes/csi/csi-vol-91510028-3e45-11ec-9461-0a580a82014a [2,head] auth (dn 
sync r=1) (dversion lock) pv=0 v=400540078 ap=1 ino=0x10001be180e 
state=1073741824 | request=1 lock=1 inodepin=1 authpin=1 0x558f99828780]
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 
mds.0.cache.ino(0x10001be180e)  pfile 0 pauth 0 plink 0 pxattr 0 plocal 0 ctime 
2021-11-05T14:35:05.441183+ valid=1
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x10001be180e) add_client_cap first cap, joining realm 
snaprealm(0x10001be180e seq 1 lc 0 cr 1 cps 2 snaps={} past_parent_snaps= 
0x558f9982a200)
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x10001be180e) encode_inodestat issuing pAsLsXsFs seq 1
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x10001be180e) encode_inodestat caps pAsLsXsFs seq 1 mseq 0 
xattrv 1
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 
mds.0.cache.ino(0x10001be180e) including xattrs version 1
debug 2023-02-16T09:49:56.265+ 7f0462329700 20 mds.0.server set_trace_dist 
added in   [inode 0x10001be180e [...2,head] 
/volumes/csi/csi-vol-91510028-3e45-11ec-9461-0a580a82014a/ auth v388627147 ap=1 
snaprealm=0x558f9982a200 f(v0 m2021-11-05T14:35:05.441183+ 2=1+1) n(v29796 
rc2022-12-21T19:28:27.124662+ b57179 36=24+12) (iauth sync r=1) (ilink sync 
r=1) (isnap sync r=1) (ifile sync r=1) (ixattr sync r=1) (iversion lock) 
caps={66426408=pAsLsXsFs/-@1} | request=1 lock=5 dirfrag=1 caps=1 
openingsnapparents=0 authpin=1 0x558f99825b80]
debug 2023-02-16T09:49:56.265+ 7f0462329700 10 mds.0.350627 
send_message_client client.66426408 10.25.1.17:0/1669277120 
client_reply(???:170 = 0 (0) Success) v1
debug 2023-02-16T09:49:56.265+ 7f0462329700  7 mds.0.cache request_finish 
request(client.66426408:170 nref=3 cr=0x558f9a150580)
debug 2023-02-16T09:49:56.265+ 7f0462

[ceph-users] Re: Extremally need help. Openshift cluster is down :c

2023-02-16 Thread kreept . sama
And one more for memory 
ceph tell mds.gml-okd-cephfs-a config set debug_mds 0/20
This logs from active mds
...
debug 2023-02-16T09:54:39.906+ 7f0460b26700 10 mds.0.cache |__ 0auth 
[dir 0x100 ~mds0/ [2,head] auth v=1619006913 cv=1619006913/1619006913 
dir_auth=0 state=1073741825|complete f(v0 10=0+10) n(v5122951 
rc2023-01-03T22:31:26.260887+ b8320360 1131=1097+34)/n(v5122951 
rc2023-01-03T22:31:26.236886+ b8122960 1092=1058+34) hs=10+0,ss=0+0 | 
child=1 subtree=1 subtreetemp=0 waiter=0 authpin=0 0x558f9aa22480]
debug 2023-02-16T09:54:39.906+ 7f0460b26700 10 mds.0.cache |__ 0auth 
[dir 0x1 / [2,head] auth v=94872547 cv=0/0 dir_auth=0 state=1610874881|complete 
f(v1 m2021-07-31T21:13:24.403917+ 3=0+3) n(v3 
rc2023-01-03T22:32:29.158670+ b264347801454 1549250=1354337+194913) 
hs=3+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 
authpin=0 0x558f9aa22000]
debug 2023-02-16T09:54:39.906+ 7f0460b26700 10 mds.0.cache 
find_stale_fragment_freeze
debug 2023-02-16T09:54:39.906+ 7f0460b26700 10 mds.0.snap check_osd_map - 
version unchanged
debug 2023-02-16T09:54:39.906+ 7f0460b26700 20 mds.0.350627 updating export 
targets, currently 0 ranks are targets
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 mds.0.350627 get_session 
have 0x558f98893900 client.66407209 10.25.1.17:0/1432178834 state open
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_client_metrics: 
mds.metrics: session=0x558f98893900
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=READ_LATENCY, session=0x558f98893900, latency=0.098249
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=WRITE_LATENCY, session=0x558f98893900, latency=0.00
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=METADATA_LATENCY, session=0x558f98893900, latenc]y=0.399851
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=CAP_INFO, session=0x558f98893900, hits=9580, misses=36
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=DENTRY_LEASE, session=0x558f98893900, hits=0, misses=2063
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=OPENED_FILES, session=0x558f98893900, opened_files=0, total_inodes=114
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=PINNED_ICAPS, session=0x558f98893900, pinned_icaps=114, total_inodes=114
debug 2023-02-16T09:54:40.130+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=OPENED_INODES, session=0x558f98893900, opened_inodes=18446744073709551435, 
total_inodes=114
debug 2023-02-16T09:54:40.191+ 7f0460b26700 20 mds.0.350627 get_task_status
debug 2023-02-16T09:54:40.191+ 7f0460b26700 20 mds.0.350627 
schedule_update_timer_task
debug 2023-02-16T09:54:40.619+ 7f045e321700 20 mds.0.cache upkeep thread 
trimming cache; last trim 1.001014352s ago
debug 2023-02-16T09:54:40.619+ 7f045e321700 10 mds.0.cache 
trim_client_leases
debug 2023-02-16T09:54:40.619+ 7f045e321700  7 mds.0.cache trim 
bytes_used=4MB limit=4GB reservation=0.05% count=0
debug 2023-02-16T09:54:40.619+ 7f045e321700  7 mds.0.cache trim_lru 
trimming 0 items from LRU size=1670 mid=1081 pintail=0 pinned=125
debug 2023-02-16T09:54:40.619+ 7f045e321700  7 mds.0.cache trim_lru trimmed 
0 items
debug 2023-02-16T09:54:40.619+ 7f045e321700  2 mds.0.cache Memory usage:  
total 510776, rss 79592, heap 356604, baseline 356604, 114 / 1673 inodes have 
caps, 114 caps, 0.0681411 caps per inode
debug 2023-02-16T09:54:40.619+ 7f045e321700  7 mds.0.server 
recall_client_state: min=100 max=1048576 total=114 flags=0xa
debug 2023-02-16T09:54:40.619+ 7f045e321700  7 mds.0.server recalled 0 
client caps.
debug 2023-02-16T09:54:40.619+ 7f045e321700 20 mds.0.cache upkeep thread 
waiting interval 1.0s
debug 2023-02-16T09:54:41.130+ 7f0464b2e700 20 mds.0.350627 get_session 
have 0x558f98893900 client.66407209 10.25.1.17:0/1432178834 state open
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_client_metrics: 
mds.metrics: session=0x558f98893900
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=READ_LATENCY, session=0x558f98893900, latency=0.098249
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=WRITE_LATENCY, session=0x558f98893900, latency=0.00
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=METADATA_LATENCY, session=0x558f98893900, latenc]y=0.399851
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=CAP_INFO, session=0x558f98893900, hits=9580, misses=36
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=DENTRY_LEASE, session=0x558f98893900, hits=0, misses=2063
debug 2023-02-16T09:54:41.131+ 7f0464b2e700 20 handle_payload: mds.metrics: 
type=

[ceph-users] Re: Extremally need help. Openshift cluster is down :c

2023-02-16 Thread kreept . sama
And we found this when active mds start booting.
conf: 
[mds]
debug_mds = 0/20
debug_mds_balancer = 1

debug 2023-02-16T10:25:15.393+ 7fd58cbc6780  0 set uid:gid to 167:167 
(ceph:ceph)
debug 2023-02-16T10:25:15.393+ 7fd58cbc6780  0 ceph version 16.2.4 
(3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable), process ceph-mds, 
pid 1
debug 2023-02-16T10:25:15.395+ 7fd58cbc6780  0 pidfile_write: ignore empty 
--pid-file
starting mds.gml-okd-cephfs-a at
debug 2023-02-16T10:28:02.642+ 7fd575aef700  0 mds.0.journaler.pq(ro) 
_finish_read got error -2
debug 2023-02-16T10:28:02.642+ 7fd575aef700 -1 mds.0.purge_queue _recover: 
Error -2 recovering write_pos
debug 2023-02-16T10:28:02.671+ 7fd575aef700 -1 mds.0.350650 unhandled write 
error (2) No such file or directory, force readonly...
debug 2023-02-16T10:28:02.671+ 7fd575aef700  0 log_channel(cluster) log 
[WRN] : force file system read-only
debug 2023-02-16T10:28:02.671+ 7fd575aef700  0 mds.0.journaler.pq(ro) 
_finish_read got error -2
debug 2023-02-16T10:28:02.671+ 7fd5742ec700  0 mds.0.cache creating system 
inode with ino:0x100
debug 2023-02-16T10:28:02.672+ 7fd5742ec700  0 mds.0.cache creating system 
inode with ino:0x1
debug 2023-02-16T10:28:02.780+ 7fd5732ea700  0 mds.0.350650 boot error 
forcing transition to read-only; MDS will try to continue
debug 2023-02-16T10:28:08.265+ 7fd5782f4700 -1 mds.pinger is_rank_lagging: 
rank=0 was never sent ping request.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW Service SSL HAProxy.cfg

2023-02-16 Thread Jimmy Spets
I forget to add that the Ceph version is 17.2.5 managed with cephadm.

/Jimmy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph noout vs ceph norebalance, which is better for minor maintenance

2023-02-16 Thread Konstantin Shalygin
Hi Will,

All our clusters with noout flag by default, since cluster birth. The reasons:

* if rebalance will starts due EDAC or SFP degradation, is faster to fix the 
issue via DC engineers and put node back to work

* noout prevents unwanted OSD's fills and the run out of space => outage of 
services

* OSD down (broken disk) Prometeus alert will be resolved without noout - after 
OSD will be marked out. Because UP state of OSD in metrics world is expression 
of (in + up). We need "fire alert" for humans, for disk replacement 🙂


Also hope this helps!

k
Sent from my iPhone

> On 16 Feb 2023, at 06:30, William Konitzer  wrote:
> Hi Dan,
> 
> I appreciate the quick response. In that case, would something like this be 
> better, or is it overkill!?
> 
> 1. ceph osd add-noout osd.x #mark out for recovery operations
> 2. ceph osd add-noin osd.x #prevent rebalancing onto the OSD
> 3. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=0 
> #disable OSD
> 4. ceph osd down osd.x #prevent it from data placement and recovery operations
> 5. Upgrade the firmware on OSD
> 6. ceph osd up osd.x
> 7. kubectl -n rook-ceph scale deployment rook-ceph-osd--* --replicas=1
> 8. ceph osd rm-noin osd.x
> 9. ceph osd rm-noout osd.x
> 
> Thanks,
> Will
> 
> 
>> On Feb 15, 2023, at 5:05 PM, Dan van der Ster  wrote:
>> 
>> Sorry -- Let me rewrite that second paragraph without overloading the
>> term "rebalancing", which I recognize is confusing.
>> 
>> ...
>> 
>> In your case, where you want to perform a quick firmware update on the
>> drive, you should just use noout.
>> 
>> Without noout, the OSD will be marked out after 5 minutes and objects
>> will be re-replicated to other OSDs -- those degraded PGs will move to
>> "backfilling" state and copy the objects on new OSDs.
>> 
>> With noout, the cluster won't start backfilling/recovering, but don't
>> worry -- this won't block IO. What happens is the disk that is having
>> its firmware upgraded will be marked "down", and IO will be accepted
>> and logged by its peers, so that when the disk is back "up" it can
>> replay ("recover") those writes to catch up.
>> 
>> 
>> The norebalance flag only impacts data movement for PGs that are not
>> degraded -- no OSDs are down. This can be useful to pause backfilling
>> e.g. when you are adding or removing hosts to a cluster.
>> 
>> -- dan
>> 
>> On Wed, Feb 15, 2023 at 2:58 PM Dan van der Ster  wrote:
>>> Hi Will,
>>> There are some misconceptions in your mail.
>>> 1. "noout" is a flag used to prevent the down -> out transition after
>>> an osd is down for several minutes. (Default 5 minutes).
>>> 2. "norebalance" is a flag used to prevent objects from being
>>> backfilling to a different OSD *if the PG is not degraded*.
>>> In your case, where you want to perform a quick firmware update on the
>>> drive, you should just use noout.
>>> Without noout, the OSD will be marked out after 5 minutes and data
>>> will start rebalancing to other OSDs.
>>> With noout, the cluster won't start rebalancing. But this won't block
>>> IO -- the disk being repaired will be "down" and IO will be accepted
>>> and logged by it's peers, so that when the disk is back "up" it can
>>> replay those writes to catch up.
>>> Hope that helps!
>>> Dan
>>> On Wed, Feb 15, 2023 at 1:12 PM  wrote:
 Hi,
 We have a discussion going on about which is the correct flag to use for 
 some maintenance on an OSD, should it be "noout" or "norebalance"? This 
 was sparked because we need to take an OSD out of service for a short 
 while to upgrade the firmware.
 One school of thought is:
 - "ceph norebalance" prevents automatic rebalancing of data between OSDs, 
 which Ceph does to ensure all OSDs have roughly the same amount of data.
 - "ceph noout" on the other hand prevents OSDs from being marked as 
 out-of-service during maintenance, which helps maintain cluster 
 performance and availability.
 - Additionally, if another OSD fails while the "norebalance" flag is set, 
 the data redundancy and fault tolerance of the Ceph cluster may be 
 compromised.
 - So if we're going to maintain the performance and reliability we need to 
 set the "ceph noout" flag to prevent the OSD from being marked as OOS 
 during maintenance and allow the automatic data redistribution feature of 
 Ceph to work as intended.
 The other opinion is:
 - With the noout flag set, Ceph clients are forced to think that OSD 
 exists and is accessible - so they continue sending requests to such OSD. 
 The OSD also remains in the crush map without any signs that it is 
 actually out. If an additional OSD fails in the cluster with the noout 
 flag set, Ceph is forced to continue thinking that this new failed OSD is 
 OK. It leads to stalled or delayed response from the OSD side to clients.
 - Norebalance instead takes into account the in/out OSD status, but 
 prevents data rebalance. Clients are 

[ceph-users] ceph-osd@86.service crashed at a random time.

2023-02-16 Thread luckydog xf
Hello, lists.

 I have a 108 OSD ceph cluster. All OSDs work fine except one OSD-86.
 ceph-osd@86.service stopped working at a random time.
 The disk is normal by checking with `smarctl -a`.
  It could be fine for  a few days after I restart it. Then it goes wrong
again.

 I paste  the related log here. It stopped at 05:26 UTC.
---
2023-02-17T05:26:37.795+ 7ff525846700  0 log_channel(cluster) log [DBG]
: 17.df scrub starts
2023-02-17T05:26:37.799+ 7ff525846700  0 log_channel(cluster) log [DBG]
: 17.df scrub ok
2023-02-17T05:26:38.779+ 7ff527049700  0 log_channel(cluster) log [DBG]
: 2.64 scrub starts
2023-02-17T05:26:38.783+ 7ff527049700  0 log_channel(cluster) log [DBG]
: 2.64 scrub ok
2023-02-17T05:26:38.871+ 7ff526848700  1 osd.86 pg_epoch: 113734
pg[20.115( v 113733'56242916 (113711'56240668,113733'56242916]
local-lis/les=113726/113727 n=1113 ec=440/440 lis/c=113726/113726
les/c/f=113727/113727/0 sis=113734) [105,86,97] r=1 lpr=113734
pi=[113726,113734)/1 luod=0'0 lua=113730'56242903 crt=113733'56242916 lcod
113733'56242915 mlcod 0'0 active mbc={}] start_peering_interval up
[105,86,97] -> [105,86,97], acting [105,97] -> [105,86,97], acting_primary
105 -> 105, up_primary 105 -> 105, role -1 -> 1, features acting
4540138292840890367 upacting 4540138292840890367
2023-02-17T05:26:38.871+ 7ff526848700  1 osd.86 pg_epoch: 113734
pg[20.115( v 113733'56242916 (113711'56240668,113733'56242916]
local-lis/les=113726/113727 n=1113 ec=440/440 lis/c=113726/113726
les/c/f=113727/113727/0 sis=113734) [105,86,97] r=1 lpr=113734
pi=[113726,113734)/1 crt=113733'56242916 lcod 113733'56242915 mlcod 0'0
unknown NOTIFY mbc={}] state: transitioning to Stray
2023-02-17T05:26:55.075+ 7ff52784a700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7ff52784a700 thread_name:tp_osd_tp

 ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus
(stable)
 1: (()+0x14420) [0x7ff54448a420]
 2: (BlueStore::ExtentMap::decode_some(ceph::buffer::v15_2_0::list&)+0x31d)
[0x561eeca36ebd]
 3: (BlueStore::ExtentMap::fault_range(KeyValueDB*, unsigned int, unsigned
int)+0x241) [0x561eeca3de21]
 4: (BlueStore::_do_read(BlueStore::Collection*,
boost::intrusive_ptr, unsigned long, unsigned long,
ceph::buffer::v15_2_0::list&, unsigned int, unsigned long)+0x153)
[0x561eeca4ae53]
 5: (BlueStore::read(boost::intrusive_ptr&,
ghobject_t const&, unsigned long, unsigned long,
ceph::buffer::v15_2_0::list&, unsigned int)+0x233) [0x561eeca4bf63]
 6: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&,
ScrubMapBuilder&, ScrubMap::object&)+0x2b5) [0x561eec873235]
 7: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x35f)
[0x561eec6f2b6f]
 8: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t,
hobject_t, bool, ThreadPool::TPHandle&)+0x8b) [0x561eec5aa00b]
 9: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x14c8) [0x561eec5bc648]
 10: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0x31b) [0x561eec5be67b]
 11: (ceph::osd::scheduler::PGScrub::run(OSD*, OSDShard*,
boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x16) [0x561eec7876b6]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x4db) [0x561eec51724b]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403)
[0x561eecbd5353]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x561eecbd8154]
 15: (()+0x8609) [0x7ff54447e609]
 16: (clone()+0x43) [0x7ff5443a3133]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
 -7193> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command assert hook 0x561ef68ea610
 -7192> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command abort hook 0x561ef68ea610
 -7191> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command leak_some_memory hook 0x561ef68ea610
 -7190> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perfcounters_dump hook 0x561ef68ea610
 -7189> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command 1 hook 0x561ef68ea610
 -7188> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perf dump hook 0x561ef68ea610
 -7187> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perfcounters_schema hook 0x561ef68ea610
 -7186> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perf histogram dump hook 0x561ef68ea610
 -7185> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command 2 hook 0x561ef68ea610
 -7184> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perf schema hook 0x561ef68ea610
 -7183> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command perf histogram schema hook 0x561ef68ea610
 -7182> 2023-02-17T05:26:23.928+ 7ff5440ded80  5 asok(0x561ef699)
register_command p

[ceph-users] ceph-iscsi-cli: cannot remove duplicated gateways.

2023-02-16 Thread luckydog xf
Hi, please see the output below.
ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is the one who is being
messed up with a wrong hostname. I want to delete it.

/iscsi-target...-igw/gateways> ls
o- gateways 
..
[Up: 2/3, Portals: 3]
  o- ceph-iscsi-gw-1.ipa.pthl.hk
.
[172.16.202.251 (UP)]
  o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
.. [172.16.202.251
(UNAUTHORIZED)]
  o- ceph-iscsi-gw-2.ipa.pthl.hk
.
[172.16.202.252 (UP)]

/iscsi-target...-igw/gateways> delete
gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Could not contact ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain. If
the gateway is permanently down. Use confirm=true to force removal.
WARNING: Forcing removal of a gateway that can still be reached by an
initiator may result in data corruption.
/iscsi-target...-igw/gateways>
/iscsi-target...-igw/gateways> delete
gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Failed : Unhandled exception: list.remove(x): x not in list

However  ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there.
Version info is ceph-iscsi-3.5-1.el8cp.noarch on RHEL 8.4.

/iscsi-target...-igw/gateways> ls
o- gateways 
..
[Up: 2/3, Portals: 3]
  o- ceph-iscsi-gw-1.ipa.pthl.hk
.
[172.16.202.251 (UP)]
  o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
... [172.16.202.251
(UNKNOWN)]
  o- ceph-iscsi-gw-2.ipa.pthl.hk
.
[172.16.202.252 (UP)]
/iscsi-target...-igw/gateways> delete
ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Failed : Unhandled exception: list.remove(x): x not in list

However  ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there.


Please help, thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW cannot list or create openidconnect providers

2023-02-16 Thread Pritha Srivastava
Hi,

Have you added oidc-provider caps to the user that is trying to create the
openid connect provider/ list openid connect providers, in your case the
user which has the access key as 'L70QT3LN71SQXWHS97Y4'. (
https://docs.ceph.com/en/quincy/radosgw/oidc/)

Thanks,
Pritha

On Fri, Feb 17, 2023 at 4:54 AM  wrote:

> Hello,
>
> I'm attempting to setup an OpenIDConnect provider with RGW. I'm doing this
> using the boto3 API & Python. However it seems that the APIs are failing in
> some unexpected ways because radosgw was not setup correctly. There is
> sample code below, and yes, I know there are "secrets" in it - but this is
> an offline test lab so I am fine with this.
>
> The first error shows this in the logs.
>
> 2023-02-16T00:45:26.860-0500 7fe19fef7700  1 == starting new request
> req=0x7fe2ccb54680 =
> 2023-02-16T00:45:26.904-0500 7fe19def3700  0 req 17562030806519127926
> 0.044000439s ERROR: listing filtered objects failed: OIDC pool:
> default.rgw.meta: oidc_url.: (2) No such file or directory
> 2023-02-16T00:45:26.904-0500 7fe19aeed700  1 == req done
> req=0x7fe2ccb54680 op status=-2 http_status=404 latency=0.044000439s ==
> 2023-02-16T00:45:26.904-0500 7fe19aeed700  1 beast: 0x7fe2ccb54680:
> 10.20.104.178 - authentik [16/Feb/2023:00:45:26.860 -0500] "POST /
> HTTP/1.1" 404 189 - "Boto3/1.26.71 Python/3.11.1
> Linux/6.0.6-76060006-generic Botocore/1.29.72" - latency=0.044000439s
>
> So the object "oidc_url" is missing from the "default.rgw.meta" pool?
>
> rados --pool default.rgw.meta ls --all
> users.uid   root.buckets
> users.uid   authentik.buckets
> roottest4
> root.bucket.meta.test2:3866fac0-854b-48b5-b3b7-bf84a166a404.1165645.1
> users.keys  ZVBTLTYRRPY7JU39WOR9
> users.uid   authentik
> users.uid   cephadmin
> users.keys  NIVIV0JSKD9D2LDC3IH4
> users.uid   root
> users.email tes...@lab.dev
> users.keys  L70QT3LN71SQXWHS97Y4
> root.bucket.meta.test:3866fac0-854b-48b5-b3b7-bf84a166a404.1204730.1
> root.bucket.meta.test4:3866fac0-854b-48b5-b3b7-bf84a166a404.1204730.2
> roottest
> roottest2
>
> Well the object is clearly not there and I do not know how to fix this.
>
> The second error produces this error in the log:
>
> 2023-02-16T01:11:29.304-0500 7fe1976e6700  1 == starting new request
> req=0x7fe2ccb54680 =
> 2023-02-16T01:11:29.312-0500 7fe18c6d0700  1 == req done
> req=0x7fe2ccb54680 op status=-22 http_status=400 latency=0.00883s ==
> 2023-02-16T01:11:29.312-0500 7fe18c6d0700  1 beast: 0x7fe2ccb54680:
> 10.20.104.178 - authentik [16/Feb/2023:01:11:29.304 -0500] "POST /
> HTTP/1.1" 400 189 - "Boto3/1.26.71 Python/3.11.1
> Linux/6.0.6-76060006-generic Botocore/1.29.72" - latency=0.00883s
>
> Its much less clear what is going on here, it just returns 400. Boto
> raises this exception, "botocore.exceptions.ClientError: An error occurred
> (Unknown) when calling the CreateOpenIDConnectProvider operation: Unknown".
>
> Has anyone seen this before and know how to setup the correct objects for
> OpenidConnect?
>
> Version info
> ==
> ceph version 17.2.5 (e04241aa9b639588fa6c864845287d2824cb6b55) quincy
> (stable)
>
>
> Examples below
> ==
>
> # creating the client works fine - I can see my user authenticate in the
> radosgw logs
> access_key_id = 'L70QT3LN71SQXWHS97Y4'
> secret_access_key = 'QEXLa5V0Zm38068n3goDtm8V6WlaDwxVmAq9W2XV'
> iam = boto3.client('iam',
>   aws_access_key_id=access_key_id,
>   aws_secret_access_key=secret_access_key,
>   region_name="default",
>   endpoint_url="https://s3.lab";)
>
> # First error
> providers_response = iam.list_open_id_connect_providers()
>
> # Second Error
> oidc_response = iam.create_open_id_connect_provider(
>   # Issuer URL
>   Url="https://login.lab/application/o/d7d64496e26c156ca9ea0802c5d7ed1c/";,
>   ClientIDList=['authentik'],
>
> ThumbprintList=['BDCC44F40254E7E1258DA4698833FFE2E8AECA3D3799044D8A1F97F7DFF20511'])
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io