[ceph-users] Re: Pull failed on cluster upgrade

2024-08-06 Thread Nicola Mori
I think I found the problem. Setting the cephadm log level to debug and 
then watching the logs during the upgrade:


  ceph config set mgr mgr/cephadm/log_to_cluster_level debug
  ceph -W cephadm --watch-debug

I found this line just before the error:

  ceph: stderr Fatal glibc error: CPU does not support x86-64-v2

The same error comes out if I try to launch the container manually on 
the culprit machine, so I'd say it's a dead end. What I don't understand 
is how such a big change like dismissing support for an old architecture 
has been introduced in a point release (18.2.2 works fine), but maybe 
it's just me that's missing some basic info about the versioning scheme 
of Ceph.


Anyway, now I'm stuck with 3 daemons running 18.2.4 and the others still 
on 18.2.2. The cluster looks happy and I see no malfunctioning, can I 
leave it in this state with no risk? If not, is it safe to rollback the 
3 upgraded daemons to the previous version?

Thanks again,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW sync gets stuck every day

2024-08-06 Thread Olaf Seibert

Hi all,

we have some Ceph clusters with RGW replication between them. It seems 
that in the last month at least, it gets stuck at around the same time 
~every day. Not 100% the same time, and also not 100% of the days, but 
in the more recent days seem to happen more, and for longer.


With "stuck" I mean that the "oldest incremental change not applied" is 
getting 5 or more minutes old, and not changing. In the past this seemed 
to resolve itself in a short time, but recently it didn't. It remained 
stuck at the same place for several hours. Also, on several different 
occasions I noticed that the shard number in question was the same.


We are using Ceph 18.2.2, image id 719d4c40e096.

The output on one end looks like this (I redacted out some of the data 
because I don't know how much of the naming would be sensitive information):


root@zone2:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e1223ae425a4 (zone2-backup)
   current time 2024-08-04T10:22:00Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync no sync (zone is master)
  data sync source: ----e8db1c51b705 (zone1-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [30,90,95]
oldest incremental change not applied: 
2024-08-04T10:05:54.015403+ [30]


while on the other side it looks ok (not more than half a minute behind):

root@zone1:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e8db1c51b705 (zone1-backup)
   current time 2024-08-04T10:23:05Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: ----e1223ae425a4 (zone2-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 4 shards
behind shards: [89,92,95,98]
oldest incremental change not applied: 
2024-08-04T10:22:53.175975+ [95]



With some experimenting, we found that redeploying the RGWs on this side 
resolves the situation: "ceph orch redeploy rgw.zone1-backup". The 
shards go into "Recovering" state and after a short time it is "caught 
up with source" as well.


Redeploying stuff seems like a much too big hammer to get things going 
again. Surely there must be something more reasonable?


Also, any ideas about how we can find out what is causing this? It may 
be that some customer has some job running every 24 hours, but that 
shouldn't cause the replication to get stuck.


Thanks in advance,

--
Olaf Seibert
Site Reliability Engineer

SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/

Current system status always at:
https://www.syseleven-status.net/

Company headquarters: Berlin
Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin
Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller, 
Jens Plogsties

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Resize RBD - New size not compatible with object map

2024-08-06 Thread Torkil Svensgaard

Hi

[ceph: root@ceph-flash1 /]# rbd info rbd_ec/projects
rbd image 'projects':
size 750 TiB in 196608000 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 15a979db61dda7
data_pool: rbd_ec_data
block_name_prefix: rbd_data.10.15a979db61dda7
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Thu Jul  7 10:57:13 2022
access_timestamp: Thu Jul  7 10:57:13 2022
modify_timestamp: Thu Jul  7 10:57:13 2022

We wanted to resize it to 1PB but that failed:

[ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 1024T
Resizing image: 0% complete...failed.
rbd: shrinking an image is only allowed with the --allow-shrink flag
2024-08-06T08:42:01.053+ 7fc996492580 -1 librbd::Operations: New 
size not compatible with object map


We can do 800T though:

[ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 800T
Resizing image: 100% complete...done.

A problem with the --1024T notation? Or we hitting some sort of size 
limit for RBD?


Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Resize RBD - New size not compatible with object map

2024-08-06 Thread Ilya Dryomov
On Tue, Aug 6, 2024 at 11:55 AM Torkil Svensgaard  wrote:
>
> Hi
>
> [ceph: root@ceph-flash1 /]# rbd info rbd_ec/projects
> rbd image 'projects':
>  size 750 TiB in 196608000 objects
>  order 22 (4 MiB objects)
>  snapshot_count: 0
>  id: 15a979db61dda7
>  data_pool: rbd_ec_data
>  block_name_prefix: rbd_data.10.15a979db61dda7
>  format: 2
>  features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten, data-pool
>  op_features:
>  flags:
>  create_timestamp: Thu Jul  7 10:57:13 2022
>  access_timestamp: Thu Jul  7 10:57:13 2022
>  modify_timestamp: Thu Jul  7 10:57:13 2022
>
> We wanted to resize it to 1PB but that failed:
>
> [ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 1024T
> Resizing image: 0% complete...failed.
> rbd: shrinking an image is only allowed with the --allow-shrink flag
> 2024-08-06T08:42:01.053+ 7fc996492580 -1 librbd::Operations: New
> size not compatible with object map
>
> We can do 800T though:
>
> [ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 800T
> Resizing image: 100% complete...done.
>
> A problem with the --1024T notation? Or we hitting some sort of size
> limit for RBD?

Hi Torkil,

The latter -- the object-map feature is limited to 25600 objects,
which with 4M objects works out to be ~976T.  For anything larger, the
object-map feature can be disabled with "rbd feature disable" command
(would need to temporarily unmap the image if it's mapped).

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Resize RBD - New size not compatible with object map

2024-08-06 Thread Torkil Svensgaard



On 06/08/2024 12:37, Ilya Dryomov wrote:

On Tue, Aug 6, 2024 at 11:55 AM Torkil Svensgaard  wrote:


Hi

[ceph: root@ceph-flash1 /]# rbd info rbd_ec/projects
rbd image 'projects':
  size 750 TiB in 196608000 objects
  order 22 (4 MiB objects)
  snapshot_count: 0
  id: 15a979db61dda7
  data_pool: rbd_ec_data
  block_name_prefix: rbd_data.10.15a979db61dda7
  format: 2
  features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten, data-pool
  op_features:
  flags:
  create_timestamp: Thu Jul  7 10:57:13 2022
  access_timestamp: Thu Jul  7 10:57:13 2022
  modify_timestamp: Thu Jul  7 10:57:13 2022

We wanted to resize it to 1PB but that failed:

[ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 1024T
Resizing image: 0% complete...failed.
rbd: shrinking an image is only allowed with the --allow-shrink flag
2024-08-06T08:42:01.053+ 7fc996492580 -1 librbd::Operations: New
size not compatible with object map

We can do 800T though:

[ceph: root@ceph-flash1 /]# rbd resize rbd_ec/projects --size 800T
Resizing image: 100% complete...done.

A problem with the --1024T notation? Or we hitting some sort of size
limit for RBD?


Hi Torkil,


Hi Ilya


The latter -- the object-map feature is limited to 25600 objects,
which with 4M objects works out to be ~976T.  For anything larger, the
object-map feature can be disabled with "rbd feature disable" command
(would need to temporarily unmap the image if it's mapped).


Thanks, we'll do that.

Mvh.

Torkil


Thanks,

 Ilya


--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recovering from total mon loss and backing up lockbox secrets

2024-08-06 Thread Boris
Hi,

I am in the process of creating disaster recovery documentation and I have
two topics where I am not sure how to do it or even if it is possible.

Is it possible to recover from a 100% mon data loss? Like all mons fail and
the actual mon data is not recoverable.

In my head I would thing that I can just create new mons with the same
cluster ID and then start everything. The OSDs still have their PGs and
data and after some period of time everything will be ok again.

But then I thought that we use dmcrypt in ceph and I would need to somehow
backup all the keys to some offsite location.

So here are my questions:
- How do I backup the lockbox secrets?
- Do I need to backup the whole mon data, and if so how can I do it?

Cheers
 Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Osds going down/flapping after Luminous to Nautilus upgrade part 1

2024-08-06 Thread Eugen Block

Hi,

the upgrade notes for Nautilus [0] contain this section:

Running nautilus OSDs will not bind to their v2 address  
automatically. They must be restarted for that to happen.


Regards,
Eugen

[0] https://docs.ceph.com/en/latest/releases/nautilus/#instructions

Zitat von Mark Kirkwood :

We have upgraded one of our Ceph clusters to Nautilus. We have run  
into 2 issues that are causing osds to flap. I'll cover the 1st one  
here, this one we solved but it raises an interesting question that  
might bear on the 2nd one (will post that next).


After upgrading deb packages to Nautilus and restarting the mons and  
mgrs we worked through restarting the osds. We started to see some  
of them flap and saw this in the osd log (many times):


2024-07-31 11:03:33.264 7f22ab6e0700  0 --1-  
[v2:[2404:130:8020:5::73]:6820/220732,v1:[2404:130:8020:5::73]:6821/220732]  
>> v1:[2404:130:8020:5::103]:6909/2987374 conn(0x555a5180  
0x555a4172c800 :-1 s=OPENED pgs=144 cs=3 l=0).fault initiating  
reconnect


And later (usually a single line):

2404:130:8020:5::135]:6903/2933993 conn(0x555ab7b03200  
0x555a099fa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=154 cs=4  
l=0).handle_connect_reply_2 connect got BADAUTHORIZER


Examining the code showed:

markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l  
"initiating reconnect" {} \;

./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc
./src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ vi src/msg/async/ProtocolV2.cc
markir@zmori:/download/ceph/src/ceph$ find . -type f -exec grep -l  
"got BADAUTHORIZER" {} \;

./src/msg/simple/Pipe.cc
./src/msg/async/ProtocolV1.cc

Which led us to suspect that the osds were using the v1 msgr  
protocol (ceph osd dump seems to validate this). We hoped that once  
we enabled the v2 msgr that this error would vanish. This appears to  
have happened.


So my question is this: looks like there is something wrong with  
communications via v1 protocol post upgrade - is that expected?


Regards

Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recovering from total mon loss and backing up lockbox secrets

2024-08-06 Thread Christian Rohmann

On 06.08.24 1:19 PM, Boris wrote:

I am in the process of creating disaster recovery documentation and I have
two topics where I am not sure how to do it or even if it is possible.

Is it possible to recover from a 100% mon data loss? Like all mons fail and
the actual mon data is not recoverable.

In my head I would thing that I can just create new mons with the same
cluster ID and then start everything. The OSDs still have their PGs and
data and after some period of time everything will be ok again.

But then I thought that we use dmcrypt in ceph and I would need to somehow
backup all the keys to some offsite location.

So here are my questions:
- How do I backup the lockbox secrets?
- Do I need to backup the whole mon data, and if so how can I do it?


You are indeed correct - the keys need to be backed up outside of Ceph!

See:

 * Issue: https://tracker.ceph.com/issues/63801
 * PR by poelzl to add automatic backups: 
https://github.com/ceph/ceph/pull/56772




Regards


Christian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Developer Summit (Tentacle) Aug 12-19

2024-08-06 Thread Noah Lehman
Hi Ceph users,

The next Ceph Developer Summit is happening virtually from August 12 – 19,
2024 adn we want to see you there. The focus of the summit will include
planning around our next release, Tentacle, and everyone in our community
is welcome to participate!

Learn more and RSVP here:
https://ceph.io/en/community/events/2024/ceph-developer-summit-tentacle/


Best,
Noah
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] RGW bucket notifications stop working after a while and blocking requests

2024-08-06 Thread Florian Schwab
Looks like the issue was fixed in the latest reef release (18.2.4)

I found the following commit that seams to fix it:
https://github.com/ceph/ceph/commit/26f1d6614bbc45a0079608718f191f94bd4eebb6

After upgrading we also haven’t encountered the problem again.


Cheers,
Florian

> On 5. Aug 2024, at 14:38, Florian Schwab  wrote:
> 
> Hi Alex,
> 
> thank you for the script. We will monitor how the queue fills ups to see if 
> this is the issue or not.
> 
> 
> Cheers,
> Florian
> 
>> On 5. Aug 2024, at 14:01, Alex Hussein-Kershaw (HE/HIM) 
>>  wrote:
>> 
>> Hi Florian,
>> 
>> We are also gearing up to use persistent bucket notifications, but have not 
>> got as far as you yet so quite interested in this. As I understand it, a 
>> bunch of new function is coming in Squid on the radosgw-admin command to 
>> allow gathering metrics from the queues, but they are not available yet in 
>> Reef.
>> 
>> I've used this: parse-notifications.py (github.com) 
>>  to parse 
>> all the objects in the queue, hopefully it helps you (credit to Yuval who 
>> wrote it). The reservation failure to me does look like the queue is full. 
>> It would surely be interesting to see what is in the queue. 
>> 
>> Best wishes,
>> Alex
>> 
>> From: Florian Schwab > >
>> Sent: Monday, August 5, 2024 11:02 AM
>> To: ceph-users@ceph.io  > >
>> Subject: [EXTERNAL] [ceph-users] RGW bucket notifications stop working after 
>> a while and blocking requests
>>  
>> [You don't often get email from fsch...@impossiblecloud.com 
>> . Learn why this is important at 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Hi,
>> 
>> we just set up 2 new ceph clusters (using rook). To do some processing of 
>> the user activity we configured a topic that sends events to Kafka.
>> 
>> After 5-12 hours this stops working with a 503 SlowDown response:
>> debug 2024-08-02T09:17:58.205+ 7ff4359ad700 1 req 13681579273117692719 
>> 0.00519s ERROR: failed to reserve notification on queue: private.rgw. 
>> error: -28
>> 
>> First thought would be that the queue is full but up to this point see 
>> messages coming into Kafka and without much activity on the RGW itself (only 
>> a few requests against the S3 API) so it can’t be a load issue.
>> 
>> What helps is to remove the notification configuration on the buckets 
>> (put-bucket-notification-configuration). If we directly re-add the previous 
>> notification configuration it also continuous working for a few hours before 
>> failing again with the same error/behaviour.
>> 
>> We haven’t been able to reproduce this if we disable persistence for the 
>> topic so it looks like it is related to the persistence option - otherwise 
>> there would be also no queuing of the event for sending to Kafka.
>> This also suggests that the issue is not with Kafka - this is also what we 
>> suspected first e.g. it can’t handle the amount of messages etc.
>> 
>> Does anyone else have or had this issue and found the cause or a suggestion 
>> on how to best continue debugging? Are there detailed metrics etc. on the 
>> size and usage of the event queue?
>> 
>> 
>> Here is the configuration for the topic and for a bucket:
>> 
>> $ radosgw-admin topic list
>> {
>>"topics": [
>>{
>>"user": "",
>>"name": "private.rgw",
>>"dest": {
>>"push_endpoint": 
>> "kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512",
>>"push_endpoint_args": 
>> "OpaqueData=&Version=2010-03-31&kafka-ack-level=broker&persistent=false&push-endpoint=kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512&use-ssl=true&verify-ssl=true",
>>"push_endpoint_topic": "private.rgw",
>>"stored_secret": true,
>>"persistent": true
>>},
>>"arn": "arn:aws:sns:ceph-objectstore::private.rgw",
>>"opaqueData": ""
>>}
>>]
>> }
>> 
>> $ aws s3api get-bucket-notification-configuration --bucket=XXX
>> {
>>"TopicConfigurations": [
>>{
>>"Id": “my-id",
>>"TopicArn": "arn:aws:sns:ceph-objectstore::private.rgw",
>>"Events": [
>>"s3:ObjectCreated:*",
>>"s3:ObjectRemoved:*"
>>]
>>}
>>]
>> }
>> 
>> 
>> Thank you for any input to solve this!
>> 
>> 
>> Cheers,
>> Florian
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io 
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> 

___
ceph-users mailing lis

[ceph-users] What's the best way to add numerous OSDs?

2024-08-06 Thread Fabien Sirjean

Hello everyone,

We need to add 180 20TB OSDs to our Ceph cluster, which currently 
consists of 540 OSDs of identical size (replicated size 3).


I'm not sure, though: is it a good idea to add all the OSDs at once? Or 
is it better to add them gradually?


The idea is to minimize the impact of rebalancing on the performance of 
CephFS, which is used in production.


Thanks in advance for your opinions and feedback 🙂

Wishing you a great summer,

Fabien
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pull failed on cluster upgrade

2024-08-06 Thread David Orman
What operating system/distribution are you running? What hardware?

David

On Tue, Aug 6, 2024, at 02:20, Nicola Mori wrote:
> I think I found the problem. Setting the cephadm log level to debug and 
> then watching the logs during the upgrade:
>
>ceph config set mgr mgr/cephadm/log_to_cluster_level debug
>ceph -W cephadm --watch-debug
>
> I found this line just before the error:
>
>ceph: stderr Fatal glibc error: CPU does not support x86-64-v2
>
> The same error comes out if I try to launch the container manually on 
> the culprit machine, so I'd say it's a dead end. What I don't understand 
> is how such a big change like dismissing support for an old architecture 
> has been introduced in a point release (18.2.2 works fine), but maybe 
> it's just me that's missing some basic info about the versioning scheme 
> of Ceph.
>
> Anyway, now I'm stuck with 3 daemons running 18.2.4 and the others still 
> on 18.2.2. The cluster looks happy and I see no malfunctioning, can I 
> leave it in this state with no risk? If not, is it safe to rollback the 
> 3 upgraded daemons to the previous version?
> Thanks again,
>
> Nicola
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> Attachments:
> * smime.p7s
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pull failed on cluster upgrade

2024-08-06 Thread Adam King
If you're using VMs,
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/6X6QIEMWDYSA6XOKEYH5OJ4TIQSBD5BL/
might be relevant

On Tue, Aug 6, 2024 at 3:21 AM Nicola Mori  wrote:

> I think I found the problem. Setting the cephadm log level to debug and
> then watching the logs during the upgrade:
>
>ceph config set mgr mgr/cephadm/log_to_cluster_level debug
>ceph -W cephadm --watch-debug
>
> I found this line just before the error:
>
>ceph: stderr Fatal glibc error: CPU does not support x86-64-v2
>
> The same error comes out if I try to launch the container manually on
> the culprit machine, so I'd say it's a dead end. What I don't understand
> is how such a big change like dismissing support for an old architecture
> has been introduced in a point release (18.2.2 works fine), but maybe
> it's just me that's missing some basic info about the versioning scheme
> of Ceph.
>
> Anyway, now I'm stuck with 3 daemons running 18.2.4 and the others still
> on 18.2.2. The cluster looks happy and I see no malfunctioning, can I
> leave it in this state with no risk? If not, is it safe to rollback the
> 3 upgraded daemons to the previous version?
> Thanks again,
>
> Nicola
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm: unable to copy ceph.conf.new

2024-08-06 Thread Magnus Larsen
Hi Ceph-users!

Ceph version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
quincy (stable)
Using cephadm to orchestrate the Ceph cluster

I’m running into https://tracker.ceph.com/issues/59189, which is fixed in next 
version—quincy 17.2.7—via
https://github.com/ceph/ceph/pull/50906

But I am unable to upgrade to the fixed version because of that bug

When I try to upgrade (using “ceph orch upgrade start –image 
internal_mirror/ceph:v17.2.7”), we see the same error message:
executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 
'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 
'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 
'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 
'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file conn = 
await self._remote_connection(host, addr) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await 
source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 
458, in run self.handle_error(exc) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise 
exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in 
run await self._send_files(path, b'') File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files 
self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 307, in handle_error raise exc from None File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await 
self._send_file(srcpath, dstpath, attrs) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await 
self._make_cd_request(b'C', attrs, size, srcpath) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request 
self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: 
/tmp/etc/ceph/ceph.conf.new: Permission denied During handling of the above 
exception, another exception occurred: Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files 
self._write_client_files(client_files, host) File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files 
self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File 
"/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file 
self.mgr.wait_async(self._write_remote_file( File 
"/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return 
self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", 
line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, 
self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 
432, in result return self.__get_result() File 
"/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise 
self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in 
_write_remote_file logger.exception(msg) 
orchestrator._interface.OrchestratorError: Unable to write 
dkcphhpcmgt028:/etc/ceph/ceph.conf: scp: /tmp/etc/ceph/ceph.conf.new: 
Permission denied

We were thinking about removing the keyring from the Ceph orchestrator 
(https://docs.ceph.com/en/latest/cephadm/operations/#putting-a-keyring-under-management),
which would then make Ceph not try to copy over a new ceph.conf, alleviating 
the problem 
(https://docs.ceph.com/en/latest/cephadm/operations/#client-keyrings-and-configs),
but in doing so, Ceph will kindly remove the key from all nodes 
(https://docs.ceph.com/en/latest/cephadm/operations/#disabling-management-of-a-keyring-file)
leaving us without the admin keyring. So that doesn’t sound like a path we want 
to take :S

Does anybody know how to get around this issue, so I can get to version where 
the issue fixed for good?

Thanks,
Magnus
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS failing because of corrupted dentries in lost+found after update from 17.2.7 to 18.2.0

2024-08-06 Thread Justin Lee
Hi Dhairya,

Thanks for the response! We tried removing it as you suggested with `rm
-rf` but the command just hangs indefinitely with no output. We are also
unable to `ls lost_found`, or otherwise interact with the directory's
contents.

Best,
Justin lee

On Fri, Aug 2, 2024 at 8:24 AM Dhairya Parmar  wrote:

> Hi Justin,
>
> You should able to delete inodes from the lost+found dirs just by simply
> `sudo rm -rf lost+found/`
>
> What do you get when you try to delete? Do you get `EROFS`?
>
> On Fri, Aug 2, 2024 at 8:42 AM Justin Lee 
> wrote:
>
>> After we updated our ceph cluster from 17.2.7 to 18.2.0 the MDS kept being
>> marked as damaged and stuck in up:standby with these errors in the log.
>>
>> debug-12> 2024-07-14T21:22:19.962+ 7f020cf3a700  1
>> mds.0.cache.den(0x4 1000b3bcfea) loaded already corrupt dentry:
>> [dentry #0x1/lost+found/1000b3bcfea [head,head] rep@0.0 NULL (dversion
>> lock) pv=0 v=2 ino=(nil) state=0 0x558ca63b6500]
>> debug-11> 2024-07-14T21:22:19.962+ 7f020cf3a700 10
>> mds.0.cache.dir(0x4) go_bad_dentry 1000b3bcfea
>>
>> these log lines are repeated a bunch of times in our MDS logs, all on
>> dentries that are within the lost+found directory. After reading this
>> mailing
>> list post , we
>> tried setting ceph config set mds mds_go_bad_corrupt_dentry false. This
>> seemed to successfully circumvent the issue, however, after a few seconds
>> our MDS crashes. Our 3 MDS are now stuck in a cycle of active -> crash ->
>> standby -> back to active. Because of this our actual ceph fs is extremely
>> laggy.
>>
>> We read here  that
>> reef now makes it possible to delete the lost+found directory, which might
>> solve our problem, but it is inaccessible, to cd, ls, rm, etc.
>>
>> Has anyone seen this type of issue or know how to solve it? Thanks!
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW bucket notifications stop working after a while and blocking requests

2024-08-06 Thread Florian Schwab
Hi,

we just set up 2 new ceph clusters (using rook). To do some processing of the 
user activity we configured a topic that sends events to Kafka.

After 5-12 hours this stops working with a 503 SlowDown response:
debug 2024-08-02T09:17:58.205+ 7ff4359ad700 1 req 13681579273117692719 
0.00519s ERROR: failed to reserve notification on queue: private.rgw. 
error: -28

First thought would be that the queue is full but up to this point see messages 
coming into Kafka and without much activity on the RGW itself (only a few 
requests against the S3 API) so it can’t be a load issue.

What helps is to remove the notification configuration on the buckets 
(put-bucket-notification-configuration). If we directly re-add the previous 
notification configuration it also continuous working for a few hours before 
failing again with the same error/behaviour.

We haven’t been able to reproduce this if we disable persistence for the topic 
so it looks like it is related to the persistence option - otherwise there 
would be also no queuing of the event for sending to Kafka.
This also suggests that the issue is not with Kafka - this is also what we 
suspected first e.g. it can’t handle the amount of messages etc.

Does anyone else have or had this issue and found the cause or a suggestion on 
how to best continue debugging? Are there detailed metrics etc. on the size and 
usage of the event queue?


Here is the configuration for the topic and for a bucket:

$ radosgw-admin topic list
{
"topics": [
{
"user": "",
"name": "private.rgw",
"dest": {
"push_endpoint": 
"kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512",
"push_endpoint_args": 
"OpaqueData=&Version=2010-03-31&kafka-ack-level=broker&persistent=false&push-endpoint=kafka://rgw-sasl-kafka-user:x...@kafka-kafka-bootstrap.kafka.svc:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512&use-ssl=true&verify-ssl=true",
"push_endpoint_topic": "private.rgw",
"stored_secret": true,
"persistent": true
},
"arn": "arn:aws:sns:ceph-objectstore::private.rgw",
"opaqueData": ""
}
]
}

$ aws s3api get-bucket-notification-configuration --bucket=XXX
{
"TopicConfigurations": [
{
"Id": “my-id",
"TopicArn": "arn:aws:sns:ceph-objectstore::private.rgw",
"Events": [
"s3:ObjectCreated:*",
"s3:ObjectRemoved:*"
]
}
]
}


Thank you for any input to solve this!


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can you return orphaned objects to a bucket?

2024-08-06 Thread vuphung69
Hi,
Currently I see it only supports the latest version, is there any way to 
support old versions like Pacific or Quincy?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW sync gets stuck every day

2024-08-06 Thread Olaf Seibert

Hi all,

we have some Ceph clusters with RGW replication between them. It seems 
that in the last month at least, it gets stuck at around the same time 
~every day. Not 100% the same time, and also not 100% of the days, but 
in the more recent days seem to happen more, and for longer.


With "stuck" I mean that the "oldest incremental change not applied" is 
getting 5 or more minutes old. In the past this seemed to resolve itself 
in a short time, but recently it didn't. It remained stuck at the same 
place for several hours.


The output on one end looks like this (I redacted out some of the data 
because I don't know how much of the naming would be sensitive information):


root@zone2:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e1223ae425a4 (zone2-backup)
   current time 2024-08-04T10:22:00Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync no sync (zone is master)
  data sync source: ----e8db1c51b705 (zone1-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [30,90,95]
oldest incremental change not applied: 
2024-08-04T10:05:54.015403+ [30]


while on the other side it looks ok (not more than half a minute behind):

root@zone1:/# radosgw-admin sync status --rgw-realm backup
  realm ----8ddf4576ebab (backup)
  zonegroup ----58af9051e063 (backup)
   zone ----e8db1c51b705 (zone1-backup)
   current time 2024-08-04T10:23:05Z
zonegroup features enabled: resharding
   disabled: compress-encrypted
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: ----e1223ae425a4 (zone2-backup)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 4 shards
behind shards: [89,92,95,98]
oldest incremental change not applied: 
2024-08-04T10:22:53.175975+ [95]



With some experimenting, we found that redeploying the RGWs on this side 
resolves the situation: "ceph orch redeploy rgw.zone1-backup". The 
shards go into "Recovering" state and after a short time it is "caught 
up with source" as well.


Redeploying stuff seems like a much too big hammer to get things going 
again. Surely there must be something more reasonable?


Also, any ideas about how we can find out what is causing this? It may 
be that some customer has some job running every 24 hours, but that 
shouldn't cause the replication to get stuck.


Thanks in advance,

--
Olaf Seibert
Site Reliability Engineer

SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/

Current system status always at:
https://www.syseleven-status.net/

Company headquarters: Berlin
Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin
Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller, 
Jens Plogsties

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph MDS failing because of corrupted dentries in lost+found after update from 17.2.7 to 18.2.0

2024-08-06 Thread Justin Lee
The actual mount command doesn't hang, we just can't interact with any of
the directory's contents once mounted. I couldn't find anything unusual in
the logs.

Best,
Justin Lee

On Fri, Aug 2, 2024 at 10:38 AM Dhairya Parmar  wrote:

> So the mount hung? Can you see anything suspicious in the logs?
>
> On Fri, Aug 2, 2024 at 7:17 PM Justin Lee 
> wrote:
>
>> Hi Dhairya,
>>
>> Thanks for the response! We tried removing it as you suggested with `rm
>> -rf` but the command just hangs indefinitely with no output. We are also
>> unable to `ls lost_found`, or otherwise interact with the directory's
>> contents.
>>
>> Best,
>> Justin lee
>>
>> On Fri, Aug 2, 2024 at 8:24 AM Dhairya Parmar  wrote:
>>
>>> Hi Justin,
>>>
>>> You should able to delete inodes from the lost+found dirs just by simply
>>> `sudo rm -rf lost+found/`
>>>
>>> What do you get when you try to delete? Do you get `EROFS`?
>>>
>>> On Fri, Aug 2, 2024 at 8:42 AM Justin Lee 
>>> wrote:
>>>
 After we updated our ceph cluster from 17.2.7 to 18.2.0 the MDS kept
 being
 marked as damaged and stuck in up:standby with these errors in the log.

 debug-12> 2024-07-14T21:22:19.962+ 7f020cf3a700  1
 mds.0.cache.den(0x4 1000b3bcfea) loaded already corrupt dentry:
 [dentry #0x1/lost+found/1000b3bcfea [head,head] rep@0.0 NULL (dversion
 lock) pv=0 v=2 ino=(nil) state=0 0x558ca63b6500]
 debug-11> 2024-07-14T21:22:19.962+ 7f020cf3a700 10
 mds.0.cache.dir(0x4) go_bad_dentry 1000b3bcfea

 these log lines are repeated a bunch of times in our MDS logs, all on
 dentries that are within the lost+found directory. After reading this
 mailing
 list post , we
 tried setting ceph config set mds mds_go_bad_corrupt_dentry false. This
 seemed to successfully circumvent the issue, however, after a few
 seconds
 our MDS crashes. Our 3 MDS are now stuck in a cycle of active -> crash
 ->
 standby -> back to active. Because of this our actual ceph fs is
 extremely
 laggy.

 We read here 
 that
 reef now makes it possible to delete the lost+found directory, which
 might
 solve our problem, but it is inaccessible, to cd, ls, rm, etc.

 Has anyone seen this type of issue or know how to solve it? Thanks!
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What's the best way to add numerous OSDs?

2024-08-06 Thread Anthony D'Atri
Since they’re 20TB, I’m going to assume that these are HDDs.

There are a number of approaches.  One common theme is to avoid rebalancing 
until after all have been added to the cluster and are up / in, otherwise you 
can end up with a storm of map updates and superfluous rebalancing.


One strategy is to set osd_crush_initial_weight = 0 temporarily, so that the 
OSDs when added won’t take any data yet.  Then when you’re ready you can set 
their CRUSH weights up to where they otherwise would be, and unset 
osd_crush_initial_weight so you don’t wonder what the heck is going on six 
months down the road.

Another is to add a staging CRUSH root.  If the new OSDs are all on new hosts, 
you can create CRUSH host buckets for them in advance so that when you create 
the OSDs they go there and again won’t immediately take data.  Then you can 
move the host buckets into the production root in quick succession.

Either way if you do want to add them to the cluster all at once, with HDDs 
you’ll want to limit the rate of backfill so you don’t DoS your clients.  One 
strategy is to leverage pg-upmap with a tool like 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Note that to use pg-upmap safely, you will need to ensure that your clients are 
all at Luminous or later, in the case of CephFS I *think* that means kernel 
4.13 or later.  `ceph features` will I think give you that information.

An older method of spreading out the backfill thundering herd was to use a for 
loop to weight up the OSDs in increments of, say, 0.1 at a time, let the 
cluster settle, then repeat.  This strategy results in at least some data 
moving twice, so it’s less efficient.  Similarly you might add, say, one OSD 
per host at a time and let the cluster settle between iterations, which would 
also be less than ideally efficient.

— aad

> On Aug 6, 2024, at 11:08 AM, Fabien Sirjean  wrote:
> 
> Hello everyone,
> 
> We need to add 180 20TB OSDs to our Ceph cluster, which currently consists of 
> 540 OSDs of identical size (replicated size 3).
> 
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or is it 
> better to add them gradually?
> 
> The idea is to minimize the impact of rebalancing on the performance of 
> CephFS, which is used in production.
> 
> Thanks in advance for your opinions and feedback 🙂
> 
> Wishing you a great summer,
> 
> Fabien
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What's the best way to add numerous OSDs?

2024-08-06 Thread Fox, Kevin M
some kernels (el7?) lie about being jewel until after they are blocked from 
connecting at jewel. then they report newer. Just fyi.


From: Anthony D'Atri 
Sent: Tuesday, August 6, 2024 5:08 PM
To: Fabien Sirjean
Cc: ceph-users
Subject: [ceph-users] Re: What's the best way to add numerous OSDs?

Check twice before you click! This email originated from outside PNNL.


Since they’re 20TB, I’m going to assume that these are HDDs.

There are a number of approaches.  One common theme is to avoid rebalancing 
until after all have been added to the cluster and are up / in, otherwise you 
can end up with a storm of map updates and superfluous rebalancing.


One strategy is to set osd_crush_initial_weight = 0 temporarily, so that the 
OSDs when added won’t take any data yet.  Then when you’re ready you can set 
their CRUSH weights up to where they otherwise would be, and unset 
osd_crush_initial_weight so you don’t wonder what the heck is going on six 
months down the road.

Another is to add a staging CRUSH root.  If the new OSDs are all on new hosts, 
you can create CRUSH host buckets for them in advance so that when you create 
the OSDs they go there and again won’t immediately take data.  Then you can 
move the host buckets into the production root in quick succession.

Either way if you do want to add them to the cluster all at once, with HDDs 
you’ll want to limit the rate of backfill so you don’t DoS your clients.  One 
strategy is to leverage pg-upmap with a tool like 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Note that to use pg-upmap safely, you will need to ensure that your clients are 
all at Luminous or later, in the case of CephFS I *think* that means kernel 
4.13 or later.  `ceph features` will I think give you that information.

An older method of spreading out the backfill thundering herd was to use a for 
loop to weight up the OSDs in increments of, say, 0.1 at a time, let the 
cluster settle, then repeat.  This strategy results in at least some data 
moving twice, so it’s less efficient.  Similarly you might add, say, one OSD 
per host at a time and let the cluster settle between iterations, which would 
also be less than ideally efficient.

— aad

> On Aug 6, 2024, at 11:08 AM, Fabien Sirjean  wrote:
>
> Hello everyone,
>
> We need to add 180 20TB OSDs to our Ceph cluster, which currently consists of 
> 540 OSDs of identical size (replicated size 3).
>
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or is it 
> better to add them gradually?
>
> The idea is to minimize the impact of rebalancing on the performance of 
> CephFS, which is used in production.
>
> Thanks in advance for your opinions and feedback 🙂
>
> Wishing you a great summer,
>
> Fabien
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: What's the best way to add numerous OSDs?

2024-08-06 Thread Boris
Hi Fabien,

additional to what Anthony said you could do the following:

- `ceph osd set nobackfill` to disable initial backfilling
- `ceph config set osd osd_mclock_override_recovery_settings true` to
override the mclock sheduler backfill settings
- Let the orchestrator add one host each time. I would wait between each
host until all the peering and stuff is done and only the backfilling is
left over. From my experience adding a whole host is not a problem, unless
you are hit by the pglog_dup bug (was fixed in pacific IIRC)
- `ceph tell 'osd.*' injectargs '--osd-max-backfills 1'` to limit the
backfilling as much as possible
- `ceph osd unset nobackfill` to start the actuall backfill process
- `ceph config set osd osd_mclock_override_recovery_settings false` after
backfilling is done. I would restart all OSDs after that to make sure the
OSDs got the correct backfilling values :)

Make sure your mons have enough oomph to handle the workload.

At least, that would be my approach when adding that amount of disks.
Usually I Only add 36 disks at a time, when capacity get a little low :)



Am Di., 6. Aug. 2024 um 17:10 Uhr schrieb Fabien Sirjean <
fsirj...@eddie.fdn.fr>:

> Hello everyone,
>
> We need to add 180 20TB OSDs to our Ceph cluster, which currently
> consists of 540 OSDs of identical size (replicated size 3).
>
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or
> is it better to add them gradually?
>
> The idea is to minimize the impact of rebalancing on the performance of
> CephFS, which is used in production.
>
> Thanks in advance for your opinions and feedback 🙂
>
> Wishing you a great summer,
>
> Fabien
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io