[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Florian Haas

On 25/09/2024 09:05, Eugen Block wrote:

Hi,

for me this worked in a 17.2.7 cluster just fine


Huh, interesting!


(except for erasure-coded pools).


Okay, *that* bit is expected. 
https://docs.ceph.com/en/quincy/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_crush_rule 
does say that the option sets the "default CRUSH rule to use when 
creating a replicated pool".



quincy-1:~ # ceph osd crush rule create-replicated new-rule default osd hdd


Mine was a rule created with "create-simple"; would that make a difference?

Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block

Still works:

quincy-1:~ # ceph osd crush rule create-simple simple-rule default osd
quincy-1:~ # ceph osd crush rule dump simple-rule
{
"rule_id": 4,
...

quincy-1:~ # ceph config set mon osd_pool_default_crush_rule 4
quincy-1:~ # ceph osd pool create test-pool6
pool 'test-pool6' created
quincy-1:~ # ceph osd pool ls detail | grep test-pool
pool 24 'test-pool6' replicated size 2 min_size 1 crush_rule 4  
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change  
2615 flags hashpspool stripe_width 0




Zitat von Florian Haas :


On 25/09/2024 09:05, Eugen Block wrote:

Hi,

for me this worked in a 17.2.7 cluster just fine


Huh, interesting!


(except for erasure-coded pools).


Okay, *that* bit is expected.  
https://docs.ceph.com/en/quincy/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_crush_rule does say that the option sets the "default CRUSH rule to use when creating a replicated  
pool".



quincy-1:~ # ceph osd crush rule create-replicated new-rule default osd hdd


Mine was a rule created with "create-simple"; would that make a difference?

Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs +inotify = caps problem?

2024-09-25 Thread Burkhard Linke

Hi,


we are currently trying to debug and understand a problem with cephfs 
and inotify watchers. A user is running Visual Studio Code with a 
workspace on a cephfs mount. VSC uses inotify for monitoring files and 
directories in the workspace:



root@cli:~# ./inotify-info
--
INotify Limits:
  max_queued_events    16,384
  max_user_instances   128
  max_user_watches 1,048,576
--
   Pid Uid    App   Watches  Instances
   3599940 1236   node    1,681  1
 1 0  systemd   106  5
   3600170 1236   node   54  1
    874797 0  udevadm    17  1
   3599118 0  systemd 7  3
   3599707 1236   systemd 7  3
   3599918 1236   node    6  1
  2047 100    dbus-daemon 3  1
  2054 0  sssd    2  1
  2139 0  systemd-logind (deleted)    1  1
  2446 0  agetty  1  1
   361 1236   node    1  1
--
Total inotify Watches:   1886
Total inotify Instances: 20
--
root@cli:~# cat /sys/kernel/debug/ceph/XYZ.client354064780/caps  | wc -l
1773083

root@cli:~# uname -a
Linux cli 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 
(2024-07-15) x86_64 GNU/Linux



So roughly 1.700 watchers result in over 1.7 million caps. (some of the 
watchers might for files on different filesystems). I've also checked 
this on the MDS side, it also reports a very high number of caps for 
that client. Running tools like lsof on the host as root only reports 
very few open files (<50). So inotify seems to be responsible for the 
massive caps build up. Terminating VSC results in a sharp drop of the 
caps  (just a few open files / directories left afterwards).



Is this a known problem?


Best regards,

Burkhard Linke

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Backup strategies for rgw s3

2024-09-25 Thread Burkhard Linke

Hi,

On 9/25/24 16:57, Adam Prycki wrote:

Hi,

I'm currently working on a project which requires us to backup 2 
separate s3 zones/realms and retain it for few months. Requirements 
were written by someone who doesn't know ceph rgw capabilities.
We have to do incremental and full backups. Each type of backup has 
separate retention period.


Is there a way to accomplish this with in a sensible way?

My fist idea would be to create multisite replication to archive-zone. 
But I cannot really enforce data retention on archive zone. It would 
require us to overwrite lifecycle policies created by our users.
As far as I know it's not possible to create zone level lifecycle 
policy. Users get their accounts are provisioned via openstack swift.


Second idea would be to create custom backup script and copy all the 
buckets in the cluster to different s3 zone. Destination buckets could 
be all versioned to have desired retention. But this option feels very 
hackish and messy. Backing up 2 separate s3 zones to one could cause 
collision in bucket names. Prefixing bucket names with additional 
information is not safe because buckets have fixed name length. 
Prefixing object key name is also not ideal.


Some backup solutions (e.g. Bareos, https://www.bareos.com) support 
backing up the content of S3 buckets. Testing this is still on our TODO 
list. I'm not sure how well S3 specific stuff like metadata, ACLs, 
versions etc. are handled (probably not at all), but it might be a good 
starting point.



Best regards,

Burkhard Linke

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Backup strategies for rgw s3

2024-09-25 Thread Tim Holloway
Well, using Ceph as its own backup system has its merits, and I've
little doubt something could be cooked up, but another alternative
would be to use a true backup system.

In my particular case, I use the Bacula backup system product. It's not
the most polished thing around, but it is a full-featured
backup/restore solution including aging archives, compressed backups
and even stuff like automated tape library management. I do incremental
daily backups and weekly full backups.

Bacula works by linking clients that can read the filesystem (or,
buckets in your case) and backend storage units, which can be physical
devices or disk directories. In my case, I backup my ceph filesystem in
a directory, size-limited so that any given backup volume is size-
limited to fit on a DVD if I wanted non-magnetic long-term store.

The backup volume file format is analogous to a tarball in that it
contains directory and attribute metadata making for a faithful backup
and restore. There are offline utilities that can be used to restore if
the master backup directory is unavailable.

The Bacula solution for backing up from S3 is a plugin for the
Enterprise Edition product. What it actually does is download the
bucket data from the S3 server to a local spool file, download the S3
metadata directly, then transmit them to the linked storage director
via the standard mechanisms.

   Tim

On Wed, 2024-09-25 at 16:57 +0200, Adam Prycki wrote:
> Hi,
> 
> I'm currently working on a project which requires us to backup 2 
> separate s3 zones/realms and retain it for few months. Requirements
> were 
> written by someone who doesn't know ceph rgw capabilities.
> We have to do incremental and full backups. Each type of backup has 
> separate retention period.
> 
> Is there a way to accomplish this with in a sensible way?
> 
> My fist idea would be to create multisite replication to archive-
> zone. 
> But I cannot really enforce data retention on archive zone. It would 
> require us to overwrite lifecycle policies created by our users.
> As far as I know it's not possible to create zone level lifecycle 
> policy. Users get their accounts are provisioned via openstack swift.
> 
> Second idea would be to create custom backup script and copy all the 
> buckets in the cluster to different s3 zone. Destination buckets
> could 
> be all versioned to have desired retention. But this option feels
> very 
> hackish and messy. Backing up 2 separate s3 zones to one could cause 
> collision in bucket names. Prefixing bucket names with additional 
> information is not safe because buckets have fixed name length. 
> Prefixing object key name is also not ideal.
> 
> Best regards
> Adam Prycki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Backup strategies for rgw s3

2024-09-25 Thread Shilpa Manjrabad Jagannath
starting from quincy, you can define rules for lifecycle to execute on
Archive zone alone by specifying
 flag under 

https://tracker.ceph.com/issues/53361


On Wed, Sep 25, 2024 at 7:59 AM Adam Prycki  wrote:

> Hi,
>
> I'm currently working on a project which requires us to backup 2
> separate s3 zones/realms and retain it for few months. Requirements were
> written by someone who doesn't know ceph rgw capabilities.
> We have to do incremental and full backups. Each type of backup has
> separate retention period.
>
> Is there a way to accomplish this with in a sensible way?
>
> My fist idea would be to create multisite replication to archive-zone.
> But I cannot really enforce data retention on archive zone. It would
> require us to overwrite lifecycle policies created by our users.
> As far as I know it's not possible to create zone level lifecycle
> policy. Users get their accounts are provisioned via openstack swift.
>
> Second idea would be to create custom backup script and copy all the
> buckets in the cluster to different s3 zone. Destination buckets could
> be all versioned to have desired retention. But this option feels very
> hackish and messy. Backing up 2 separate s3 zones to one could cause
> collision in bucket names. Prefixing bucket names with additional
> information is not safe because buckets have fixed name length.
> Prefixing object key name is also not ideal.
>
> Best regards
> Adam Prycki
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Backup strategies for rgw s3

2024-09-25 Thread Joachim Kraftmayer
Hi Adam,
we started a github project for s3/SWIFT synchronization, backup, migration
and more use cases.
You also can use it in combination with backup solutions.

https://github.com/clyso/chorus

Joachim

  joachim.kraftma...@clyso.com

  www.clyso.com

  Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306



Am Mi., 25. Sept. 2024 um 19:11 Uhr schrieb Shilpa Manjrabad Jagannath <
smanj...@redhat.com>:

> starting from quincy, you can define rules for lifecycle to execute on
> Archive zone alone by specifying
>  flag under 
>
> https://tracker.ceph.com/issues/53361
>
>
> On Wed, Sep 25, 2024 at 7:59 AM Adam Prycki  wrote:
>
> > Hi,
> >
> > I'm currently working on a project which requires us to backup 2
> > separate s3 zones/realms and retain it for few months. Requirements were
> > written by someone who doesn't know ceph rgw capabilities.
> > We have to do incremental and full backups. Each type of backup has
> > separate retention period.
> >
> > Is there a way to accomplish this with in a sensible way?
> >
> > My fist idea would be to create multisite replication to archive-zone.
> > But I cannot really enforce data retention on archive zone. It would
> > require us to overwrite lifecycle policies created by our users.
> > As far as I know it's not possible to create zone level lifecycle
> > policy. Users get their accounts are provisioned via openstack swift.
> >
> > Second idea would be to create custom backup script and copy all the
> > buckets in the cluster to different s3 zone. Destination buckets could
> > be all versioned to have desired retention. But this option feels very
> > hackish and messy. Backing up 2 separate s3 zones to one could cause
> > collision in bucket names. Prefixing bucket names with additional
> > information is not safe because buckets have fixed name length.
> > Prefixing object key name is also not ideal.
> >
> > Best regards
> > Adam Prycki
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Backup strategies for rgw s3

2024-09-25 Thread Adam Prycki
Yes, I know. It's just that I would need to define zone wide default 
lifecycle.
For example, archivezone stores 30 days of object versions unless 
specified otherwise.

Is there a way to do it?

As far as I know lifecycle you linked is configured per bucket.
As a small cloud provide we cannot really configure lifecycle policies 
for users.


Adam Prycki

On 25.09.2024 19:10, Shilpa Manjrabad Jagannath wrote:

starting from quincy, you can define rules for lifecycle to execute on
Archive zone alone by specifying
 flag under 

https://tracker.ceph.com/issues/53361


On Wed, Sep 25, 2024 at 7:59 AM Adam Prycki  wrote:


Hi,

I'm currently working on a project which requires us to backup 2
separate s3 zones/realms and retain it for few months. Requirements were
written by someone who doesn't know ceph rgw capabilities.
We have to do incremental and full backups. Each type of backup has
separate retention period.

Is there a way to accomplish this with in a sensible way?

My fist idea would be to create multisite replication to archive-zone.
But I cannot really enforce data retention on archive zone. It would
require us to overwrite lifecycle policies created by our users.
As far as I know it's not possible to create zone level lifecycle
policy. Users get their accounts are provisioned via openstack swift.

Second idea would be to create custom backup script and copy all the
buckets in the cluster to different s3 zone. Destination buckets could
be all versioned to have desired retention. But this option feels very
hackish and messy. Backing up 2 separate s3 zones to one could cause
collision in bucket names. Prefixing bucket names with additional
information is not safe because buckets have fixed name length.
Prefixing object key name is also not ideal.

Best regards
Adam Prycki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Backup strategies for rgw s3

2024-09-25 Thread Adam Prycki

Hi,

I'm currently working on a project which requires us to backup 2 
separate s3 zones/realms and retain it for few months. Requirements were 
written by someone who doesn't know ceph rgw capabilities.
We have to do incremental and full backups. Each type of backup has 
separate retention period.


Is there a way to accomplish this with in a sensible way?

My fist idea would be to create multisite replication to archive-zone. 
But I cannot really enforce data retention on archive zone. It would 
require us to overwrite lifecycle policies created by our users.
As far as I know it's not possible to create zone level lifecycle 
policy. Users get their accounts are provisioned via openstack swift.


Second idea would be to create custom backup script and copy all the 
buckets in the cluster to different s3 zone. Destination buckets could 
be all versioned to have desired retention. But this option feels very 
hackish and messy. Backing up 2 separate s3 zones to one could cause 
collision in bucket names. Prefixing bucket names with additional 
information is not safe because buckets have fixed name length. 
Prefixing object key name is also not ideal.


Best regards
Adam Prycki


smime.p7s
Description: Kryptograficzna sygnatura S/MIME
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mds daemon damaged - assert failed

2024-09-25 Thread Kyriazis, George


> On Sep 25, 2024, at 1:05 AM, Eugen Block  wrote:
> 
> Great that you got your filesystem back.
> 
>> cephfs-journal-tool journal export
>> cephfs-journal-tool event recover_dentries summary
>> 
>> Both failed
> 
> Your export command seems to be missing the output file, or was it not the 
> exact command?

Yes I didn’t include the output file in my snippet.  Sorry for the confusion.  
But the command did in fact complain that the journal was corrupted.

> 
>> Also, I understand that the metadata itself is sitting on the disk, but it 
>> looks like a single point of failure.  What’s the logic behind having a 
>> simple metadata location, but multiple mds servers?
> 
> I think there's a misunderstanding, the metadata is in the cephfs metadata 
> pool, not on the local disk of your machine.
> 

By “disk” I meant the concept of permanent storage, ie. Ceph.  Yes, our 
understanding matches.  But the question still remains, as to why that assert 
would trigger.  Is it because of a software issue (bug?) that caused the 
journal to be corrupted, or something else corrupted the journal that caused 
the MDS to throw the assertion?  Basically, I’m trying to find what could be a 
possible root-cause..

Thank you!

George


> 
> Zitat von "Kyriazis, George" :
> 
>> I managed to recover my filesystem.
>> 
>> cephfs-journal-tool journal export
>> cephfs-journal-tool event recover_dentries summary
>> 
>> Both failed
>> 
>> But truncating the journal and following some of the instructions in 
>> https://people.redhat.com/bhubbard/nature/default/cephfs/disaster-recovery-experts/
>>  helped me to get the mds up.
>> 
>> Then I scrubbed and repaired the filesystem, and I “believe” I’m back in 
>> business.
>> 
>> What is weird though is that an assert failed as shown in the stack dump 
>> below.  Was that a legitimate assertion that indicates a bigger issue, or 
>> was it a false assertion?
>> 
>> Also, I understand that the metadata itself is sitting on the disk, but it 
>> looks like a single point of failure.  What’s the logic behind having a 
>> simple metadata location, but multiple mds servers?
>> 
>> Thanks!
>> 
>> George
>> 
>> 
>> On Sep 24, 2024, at 5:55 AM, Eugen Block  wrote:
>> 
>> Hi,
>> 
>> I would probably start by inspecting the journal with the 
>> cephfs-journal-tool [0]:
>> 
>> cephfs-journal-tool [--rank=:{mds-rank|all}] journal inspect
>> 
>> And it could be helful to have the logs prior to the assert.
>> 
>> [0] 
>> https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-inspect
>> 
>> Zitat von "Kyriazis, George" :
>> 
>> Hello ceph users,
>> 
>> I am in the unfortunate situation of having a status of “1 mds daemon 
>> damaged”.  Looking at the logs, I see that the daemon died with an assert as 
>> follows:
>> 
>> ./src/osdc/Journaler.cc: 1368: FAILED ceph_assert(trim_to > trimming_pos)
>> 
>> ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x12a) [0x73a83189d7d9]
>> 2: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
>> 3: (Journaler::_trim()+0x671) [0x57235caa70b1]
>> 4: (Journaler::_finish_write_head(int, Journaler::Header&, 
>> C_OnFinisher*)+0x171) [0x57235caaa8f1]
>> 5: (Context::complete(int)+0x9) [0x57235c716849]
>> 6: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
>> 7: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
>> 8: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
>> 
>>0> 2024-09-23T14:10:26.490-0500 73a822c006c0 -1 *** Caught signal 
>> (Aborted) **
>> in thread 73a822c006c0 thread_name:MR_Finisher
>> 
>> ceph version 18.2.2 (e9fe820e7fffd1b7cde143a9f77653b73fcec748) reef (stable)
>> 1: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050) [0x73a83105b050]
>> 2: /lib/x86_64-linux-gnu/libc.so.6(+0x8ae2c) [0x73a8310a9e2c]
>> 3: gsignal()
>> 4: abort()
>> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x185) [0x73a83189d834]
>> 6: /usr/lib/ceph/libceph-common.so.2(+0x29d974) [0x73a83189d974]
>> 7: (Journaler::_trim()+0x671) [0x57235caa70b1]
>> 8: (Journaler::_finish_write_head(int, Journaler::Header&, 
>> C_OnFinisher*)+0x171) [0x57235caaa8f1]
>> 9: (Context::complete(int)+0x9) [0x57235c716849]
>> 10: (Finisher::finisher_thread_entry()+0x16d) [0x73a83194659d]
>> 11: /lib/x86_64-linux-gnu/libc.so.6(+0x89134) [0x73a8310a8134]
>> 12: /lib/x86_64-linux-gnu/libc.so.6(+0x1097dc) [0x73a8311287dc]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> 
>> 
>> As listed above, I am running 18.2.2 on a proxmox cluster with a hybrid 
>> hdd/sdd setup.  2 cephfs filesystems.  The mds responsible for the hdd 
>> filesystem is the one that died.
>> 
>> Output of ceph -s follows:
>> 
>> root@vis-mgmt:~/bin# ceph -s
>> cluster:
>>   id: ec2c9542-dc1b-4af6-9f21-0adbcabb9452
>>   health: HEALTH_ERR
>>   1 filesystem is degraded
>>   1 filesystem is offli

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block

Hi,

for me this worked in a 17.2.7 cluster just fine (except for  
erasure-coded pools).


quincy-1:~ # ceph osd crush rule create-replicated new-rule default osd hdd

quincy-1:~ # ceph config set mon osd_pool_default_crush_rule 1

quincy-1:~ # ceph osd pool create test-pool2
pool 'test-pool2' created

quincy-1:~ # ceph osd pool ls detail | grep test-pool2
pool 20 'test-pool2' replicated size 2 min_size 1 crush_rule 1  
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change  
2593 flags hashpspool stripe_width 0


quincy-1:~ # ceph versions
{
...
"overall": {
"ceph version 17.2.7  
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 11

}
}

Setting the option globally works as well for me.

Regards,
Eugen

Zitat von Florian Haas :


Hello everyone,

my cluster has two CRUSH rules: the default replicated_rule  
(rule_id 0), and another rule named rack-aware (rule_id 1).


Now, if I'm not misreading the config reference, I should be able to  
define that all future-created pools use the rack-aware rule, by  
setting osd_pool_default_crush_rule to 1.


I've verified that this option is defined in  
src/common/options/global.yaml.in, so the "global" configuration  
section should be the applicable one (I did try with "mon" and "osd"  
also, for good measure).


However, setting this option, in Quincy, apparently has no effect:

# ceph config set global osd_pool_default_crush_rule 1
# ceph osd pool create foo
pool 'foo' created
# ceph osd pool ls detail | grep foo
# pool 9 'foo' replicated size 3 min_size 2 crush_rule 0 object_hash  
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 264 flags  
hashpspool stripe_width 0


I am seeing this behaviour in 17.2.7. After an upgrade to Reef  
(18.2.4) it is gone, the option behaves as documented, and new pools  
are created with a crush_rule of 1:


# ceph osd pool create bar
pool 'bar' created
# ceph osd pool ls detail | grep bar
pool 10 'bar' replicated size 3 min_size 2 crush_rule 1 object_hash  
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 302 flags  
hashpspool stripe_width 0 read_balance_score 4.00


However, the documentation at  
https://docs.ceph.com/en/quincy/rados/configuration/pool-pg-config-ref/#confval-osd_pool_default_crush_rule asserts that osd_pool_default_crush_rule should already work in Quincy, and the Reef release notes at https://docs.ceph.com/en/latest/releases/reef/ don't mention a fix covering  
this.


Am I doing something wrong? Is this a documentation bug, and the  
option can't work in Quincy? Was this "accidentally" fixed at some  
point in the Reef cycle?


Thanks in advance for any insight you might be able to share.

Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: Bucket Notifications v2 & Multisite Redundancy

2024-09-25 Thread Alex Hussein-Kershaw (HE/HIM)
Sadly I think I've found another issue here that prevents my use case even with 
notifications_v2 disabled.

My repro scenario:

  *
Deployed a fresh multisite Ceph cluster at 19.1.1 (siteA is the master, siteB 
is non-master)
  *
Immediately disabled notifications_v2 and update/commit the period.
  *
Create an S3 user and bucket.

Then, the working parts:

  *
Create a topic on siteA, siteB doesn't show it with "topic list".
  *
Create a notification (for replication:* events) on siteA, siteB doesn't show 
it with "notification list".
  *
Write an object to the bucket on siteB. The object is replicated and an event 
shows up on the topic with "topic stats". All good so far.

...and the problematic part:

  *
Create a topic on siteB, siteB shows the single topic with "topic list", 
however siteA now shows both topics.
  *
Create a notification on siteB, again siteB shows the single notification with 
"notification list", however siteA shows both notifications.
  *
Write an object to the bucket on siteB. The object is replicated to siteA and 
an event is added to both topics.

So I think that is demonstrating the inability of notification config to remain 
single site on the latest Squid RC.
Given the conversation we had below I think this is a bug. Happy to raise a 
tracker.  Welcome any thoughts. I'll try to repro this on Reef shortly.

Thanks,
Alex


From: Alex Hussein-Kershaw (HE/HIM) 
Sent: Tuesday, September 17, 2024 11:02 AM
To: Yuval Lifshitz ; Ceph Users 
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

Thanks 🙂

I've raised:
Bug #68102: rgw: "radosgw-admin topic list" may contain duplicated data and 
redundant nesting - rgw - 
Ceph
Enhancement #68104: rgw: Add a "disable replication" flag to bucket 
notification configuration - rgw - Ceph

Also re-including the mailing list as it was dropped.


From: Yuval Lifshitz 
Sent: Tuesday, September 17, 2024 10:36 AM
To: Alex Hussein-Kershaw (HE/HIM) 
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

ok, got it.
not even sure we support attribute filtering for these types of events... i 
think you have to go with v1 for now.

would be great if you also submit a tracker for the v1 topic list

On Tue, Sep 17, 2024 at 12:24 PM Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>> wrote:
Indeed, I am using the s3:Replication:Create event (which I think is equivalent 
to s3:ObjectSynced:Create). But this does not solve the problem of the event 
being added to both topics on the site that received the replication.


From: Yuval Lifshitz mailto:ylifs...@redhat.com>>
Sent: Tuesday, September 17, 2024 10:21 AM
To: Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>>
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

if you want to get a notification when an object is synced, you should use a 
different type of notification. the fact that an object is uploaded to siteB 
does not mean it is immediately synced to siteA.
would recommend using s3:ObjectSynced:Create event type in this case. you will 
get this event only when an object is synced.

On Tue, Sep 17, 2024 at 12:05 PM Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>> wrote:
Using the sites sounds sensible, and probably better than using my suggestion 
of application deployment name as they are not subject to changing (and 
actually we do genuinely use siteA and siteB exclusively as the zone names,).

I want the siteA application to be notified of replication changes synced 
across from PUTs made on siteB, so I need the flip of what you are suggesting, 
i.e. create notification on siteA with filter "x-amz-metadata-site" == "siteB".

That's fine I think (and doesn't really complicate the scenario), but requires 
the application to now be aware of the zone it is, which isn't information it 
has without some plumbing (and doesn't seem like something an application would 
typically be aware of?). That's on top of the plumbing to add the metadata 
header into every place where I do an S3 PUT.

I still think I prefer using v1 and attempting to contribute an enhancement 
here to set a flag to disable multisite on a per notification basis, given the 
above; assuming you agree that this is a sensible enhancement.


From: Yuval Lifshitz mailto:ylifs...@redhat.com>>
Sent: Monday, September 16, 2024 6:15 PM
To: Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>>
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

regarding the filter. i don't really follow.
on siteA create a notification (with id "notifA") with filter 
"x-amz-metadata-site" == "siteA" that point to topicA (reachable in site A)
and on siteB cr

[ceph-users] Re: [EXTERNAL] Re: Bucket Notifications v2 & Multisite Redundancy

2024-09-25 Thread Alex Hussein-Kershaw (HE/HIM)
I failed to reproduce this issue in Reef 18.2.4. So I think this is a Squid 
regression of the notification v1 functionality.

In Reef, all the notification and topic config remains on the site it was 
created on.

I raised: Bug #68227: rgw/notifications: notifications and topics appear on 
multisite even with notifications_v2 disabled - rgw - 
Ceph.


From: Alex Hussein-Kershaw (HE/HIM) 
Sent: Wednesday, September 25, 2024 10:42 AM
To: Yuval Lifshitz ; Ceph Users 
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

Sadly I think I've found another issue here that prevents my use case even with 
notifications_v2 disabled.

My repro scenario:

  *
Deployed a fresh multisite Ceph cluster at 19.1.1 (siteA is the master, siteB 
is non-master)
  *
Immediately disabled notifications_v2 and update/commit the period.
  *
Create an S3 user and bucket.

Then, the working parts:

  *
Create a topic on siteA, siteB doesn't show it with "topic list".
  *
Create a notification (for replication:* events) on siteA, siteB doesn't show 
it with "notification list".
  *
Write an object to the bucket on siteB. The object is replicated and an event 
shows up on the topic with "topic stats". All good so far.

...and the problematic part:

  *
Create a topic on siteB, siteB shows the single topic with "topic list", 
however siteA now shows both topics.
  *
Create a notification on siteB, again siteB shows the single notification with 
"notification list", however siteA shows both notifications.
  *
Write an object to the bucket on siteB. The object is replicated to siteA and 
an event is added to both topics.

So I think that is demonstrating the inability of notification config to remain 
single site on the latest Squid RC.
Given the conversation we had below I think this is a bug. Happy to raise a 
tracker.  Welcome any thoughts. I'll try to repro this on Reef shortly.

Thanks,
Alex


From: Alex Hussein-Kershaw (HE/HIM) 
Sent: Tuesday, September 17, 2024 11:02 AM
To: Yuval Lifshitz ; Ceph Users 
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

Thanks 🙂

I've raised:
Bug #68102: rgw: "radosgw-admin topic list" may contain duplicated data and 
redundant nesting - rgw - 
Ceph
Enhancement #68104: rgw: Add a "disable replication" flag to bucket 
notification configuration - rgw - Ceph

Also re-including the mailing list as it was dropped.


From: Yuval Lifshitz 
Sent: Tuesday, September 17, 2024 10:36 AM
To: Alex Hussein-Kershaw (HE/HIM) 
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

ok, got it.
not even sure we support attribute filtering for these types of events... i 
think you have to go with v1 for now.

would be great if you also submit a tracker for the v1 topic list

On Tue, Sep 17, 2024 at 12:24 PM Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>> wrote:
Indeed, I am using the s3:Replication:Create event (which I think is equivalent 
to s3:ObjectSynced:Create). But this does not solve the problem of the event 
being added to both topics on the site that received the replication.


From: Yuval Lifshitz mailto:ylifs...@redhat.com>>
Sent: Tuesday, September 17, 2024 10:21 AM
To: Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>>
Subject: Re: [EXTERNAL] Re: [ceph-users] Bucket Notifications v2 & Multisite 
Redundancy

if you want to get a notification when an object is synced, you should use a 
different type of notification. the fact that an object is uploaded to siteB 
does not mean it is immediately synced to siteA.
would recommend using s3:ObjectSynced:Create event type in this case. you will 
get this event only when an object is synced.

On Tue, Sep 17, 2024 at 12:05 PM Alex Hussein-Kershaw (HE/HIM) 
mailto:alex...@microsoft.com>> wrote:
Using the sites sounds sensible, and probably better than using my suggestion 
of application deployment name as they are not subject to changing (and 
actually we do genuinely use siteA and siteB exclusively as the zone names,).

I want the siteA application to be notified of replication changes synced 
across from PUTs made on siteB, so I need the flip of what you are suggesting, 
i.e. create notification on siteA with filter "x-amz-metadata-site" == "siteB".

That's fine I think (and doesn't really complicate the scenario), but requires 
the application to now be aware of the zone it is, which isn't information it 
has without some plumbing (and doesn't seem like something an application would 
typically be aware of?). That's on top of the plumbing to add the metadata 
header into every place where I do an S3 PUT.

I still think I prefer using v1 and attempting to

[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block
Hm, do you have any local ceph.conf on your client which has an  
override for this option as well? By the way, how do you bootstrap  
your cluster? Is it cephadm based?


Zitat von Florian Haas :


Hi Eugen,

I've just torn down and completely respun my cluster, on 17.2.7.

Recreated my CRUSH rule, set osd_pool_default_crush_rule to its rule_id, 1.

Created a new pool.

That new pool still has crush_rule 0, just as before and contrary to  
what you're seeing.


I'm a bit puzzled, because I'm out of ideas as to what could break  
on my cluster and work fine on yours, to cause this. Odd.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Eugen Block
I redeployed a different single-node-cluster with quincy 17.2.6 and it  
works there as well.


Zitat von Eugen Block :

Hm, do you have any local ceph.conf on your client which has an  
override for this option as well? By the way, how do you bootstrap  
your cluster? Is it cephadm based?


Zitat von Florian Haas :


Hi Eugen,

I've just torn down and completely respun my cluster, on 17.2.7.

Recreated my CRUSH rule, set osd_pool_default_crush_rule to its rule_id, 1.

Created a new pool.

That new pool still has crush_rule 0, just as before and contrary  
to what you're seeing.


I'm a bit puzzled, because I'm out of ideas as to what could break  
on my cluster and work fine on yours, to cause this. Odd.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Florian Haas

On 25/09/2024 15:21, Eugen Block wrote:

Hm, do you have any local ceph.conf on your client which has an
override for this option as well?


No.


By the way, how do you bootstrap your cluster? Is it cephadm based?


This one is bootstrapped (on Quincy) with ceph-ansible. And when the 
"ceph config set" change didn't make a difference, I did also make a 
point of cycling all my mons and osds (which shouldn't be necessary, but 
I figured I'd try that, just in case).


And I also confirmed this same issue, in Quincy, after the cluster was 
adopted into cephadm management. At that point, the behaviour was still 
unchanged.


It was only after I upgraded the cluster to Reef, with 
cephadm/ceph orch, that the problem went away.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: osd_pool_default_crush_rule being ignored?

2024-09-25 Thread Florian Haas

Hi Eugen,

I've just torn down and completely respun my cluster, on 17.2.7.

Recreated my CRUSH rule, set osd_pool_default_crush_rule to its rule_id, 1.

Created a new pool.

That new pool still has crush_rule 0, just as before and contrary to 
what you're seeing.


I'm a bit puzzled, because I'm out of ideas as to what could break on my 
cluster and work fine on yours, to cause this. Odd.


Cheers,
Florian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph orchestrator not refreshing device list

2024-09-25 Thread Bob Gibson
Hi,

We recently converted a legacy cluster running Quincy v17.2.7 to cephadm. The 
conversion went smoothly and left all osds unmanaged by the orchestrator as 
expected. We’re now in the process of converting the osds to be managed by the 
orchestrator. We successfully converted a few of them, but then the 
orchestrator somehow got confused. `ceph health detail` reports a “stray 
daemon” for the osd we’re trying to convert, and the orchestrator is unable to 
refresh its device list so it doesn’t see any available devices.

From the perspective of the osd node, the osd has been wiped and is ready to be 
reinstalled. We’ve also rebooted the node for good measure. `ceph osd tree` 
shows that the osd has been destroyed, but the orchestrator won’t reinstall it 
because it thinks the device is still active. The orchestrator device 
information is stale, but we’re unable to refresh it. The usual recommended 
workaround of failing over the mgr hasn’t helped. We’ve also tried `ceph orch 
device ls —refresh` to no avail. In fact after running that command subsequent 
runs of `ceph orch device ls` produce no output until the mgr is failed over 
again.

Is there a way to force the orchestrator to refresh its list of devices when in 
this state? If not, can anyone offer any suggestions on how to fix this problem?

Cheers,
/rjg

P.S. Some additional information in case it’s helpful...

We’re using the following command to replace existing devices so that they’re 
managed by the orchestrator:

```
ceph orch osd rm  --replace —zap
```

and we’re currently stuck on osd 88.

```
ceph health detail
HEALTH_WARN 1 stray daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
stray daemon osd.88 on host ceph-osd31 not managed by cephadm
```

`ceph osd tree` shows that the osd has been destroyed and is ready to be 
replaced:

```
ceph osd tree-from ceph-osd31
ID   CLASS  WEIGHTTYPE NAMESTATUS REWEIGHT  PRI-AFF
-46 34.93088  host ceph-osd31
 84ssd   3.49309  osd.84  up   1.0  1.0
 85ssd   3.49309  osd.85  up   1.0  1.0
 86ssd   3.49309  osd.86  up   1.0  1.0
 87ssd   3.49309  osd.87  up   1.0  1.0
 88ssd   3.49309  osd.88   destroyed 0  1.0
 89ssd   3.49309  osd.89  up   1.0  1.0
 90ssd   3.49309  osd.90  up   1.0  1.0
 91ssd   3.49309  osd.91  up   1.0  1.0
 92ssd   3.49309  osd.92  up   1.0  1.0
 93ssd   3.49309  osd.93  up   1.0  1.0
```

The cephadm log shows a claim on node `ceph-osd31` for that osd:

```
2024-09-25T14:15:45.699348-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims -> 
{'ceph-osd31': ['88']}
2024-09-25T14:15:45.699534-0400 mgr.ceph-mon3.qzjgws [INF] Found osd claims for 
drivegroup ceph-osd31 -> {'ceph-osd31': ['88']}
```

`ceph orch device ls` shows that the device list isn’t refreshing:

```
ceph orch device ls ceph-osd31
HOSTPATH  TYPE  DEVICE IDSIZE  
AVAILABLE  REFRESHED  REJECT REASONS
ceph-osd31  /dev/sdc  ssd   INTEL_SSDSC2KG038T8_PHYG039603PE3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdd  ssd   INTEL_SSDSC2KG038T8_PHYG039600AY3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sde  ssd   INTEL_SSDSC2KG038T8_PHYG039600CW3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdf  ssd   INTEL_SSDSC2KG038T8_PHYG039600CM3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdg  ssd   INTEL_SSDSC2KG038T8_PHYG039600UB3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdh  ssd   INTEL_SSDSC2KG038T8_PHYG039603753P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdi  ssd   INTEL_SSDSC2KG038T8_PHYG039603R63P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdj  ssd   INTEL_SSDSC2KG038TZ_PHYJ4011032M3P8DGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdk  ssd   INTEL_SSDSC2KG038TZ_PHYJ3234010J3P8DGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
ceph-osd31  /dev/sdl  ssd   INTEL_SSDSC2KG038T8_PHYG039603NS3P8EGN  3576G  No   
  22h agoInsufficient space (<10 extents) on vgs, LVM detected, locked
```

`ceph node ls` thinks the osd still exists

```
ceph node ls osd | jq -r '."ceph-osd31"'
[
  84,
  85,
  86,
  87,
  88, <— this shouldn’t exist
  89,
  90,
  9