[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process
Hi, I would like to second Nico's comment. What happened to the idea that a deployment tool should be idempotent? The most natural option would be: 1) start install -> something fails 2) fix problem 3) repeat exact same deploy command -> deployment picks up at current state (including cleaning up failed state markers) and tries to continue until next issue (go to 2) I'm not sure (meaning: its a terrible idea) if its a good idea to provide a single command to wipe a cluster. Just for the fat finger syndrome. This seems safe only if it would be possible to mark a cluster as production somehow (must be sticky, that is, cannot be unset), which prevents a cluster destroy command (or any too dangerous command) from executing. I understand the test case in the tracker, but having such test-case utils that can run on a production cluster and destroy everything seems a bit dangerous. I think destroying a cluster should be a manual and tedious process and figuring out how to do it should be part of the learning experience. So my answer to "how do I start over" would be "go figure it out, its an important lesson". Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Friday, May 26, 2023 10:40 PM To: Redouane Kachach Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hello Redouane, much appreciated kick-off for improving cephadm. I was wondering why cephadm does not use a similar approach to rook in the sense of "repeat until it is fixed?" For the background, rook uses a controller that checks the state of the cluster, the state of monitors, whether there are disks to be added, etc. It periodically restarts the checks and when needed shifts monitors, creates OSDs, etc. My question is, why not have a daemon or checker subcommand of cephadm that a) checks what the current cluster status is (i.e. cephadm verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)? I think that option would be much more beneficial than the other two suggested ones. Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process
Hey Frank, in regards to destroying a cluster, I'd suggest to reuse the old --yes-i-really-mean-it parameter, as it is already in use by ceph osd destroy [0]. Then it doesn't matter whether it's prod or not, if you really mean it ... ;-) Best regards, Nico [0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/ Frank Schilder writes: > Hi, I would like to second Nico's comment. What happened to the idea that a > deployment tool should be idempotent? The most natural option would be: > > 1) start install -> something fails > 2) fix problem > 3) repeat exact same deploy command -> deployment picks up at current state > (including cleaning up failed state markers) and tries to continue until next > issue (go to 2) > > I'm not sure (meaning: its a terrible idea) if its a good idea to > provide a single command to wipe a cluster. Just for the fat finger > syndrome. This seems safe only if it would be possible to mark a > cluster as production somehow (must be sticky, that is, cannot be > unset), which prevents a cluster destroy command (or any too dangerous > command) from executing. I understand the test case in the tracker, > but having such test-case utils that can run on a production cluster > and destroy everything seems a bit dangerous. > > I think destroying a cluster should be a manual and tedious process > and figuring out how to do it should be part of the learning > experience. So my answer to "how do I start over" would be "go figure > it out, its an important lesson". > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Nico Schottelius > Sent: Friday, May 26, 2023 10:40 PM > To: Redouane Kachach > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap > process > > > Hello Redouane, > > much appreciated kick-off for improving cephadm. I was wondering why > cephadm does not use a similar approach to rook in the sense of "repeat > until it is fixed?" > > For the background, rook uses a controller that checks the state of the > cluster, the state of monitors, whether there are disks to be added, > etc. It periodically restarts the checks and when needed shifts > monitors, creates OSDs, etc. > > My question is, why not have a daemon or checker subcommand of cephadm > that a) checks what the current cluster status is (i.e. cephadm > verify-cluster) and b) fixes the situation (i.e. cephadm > verify-and-fix-cluster)? > > I think that option would be much more beneficial than the other two > suggested ones. > > Best regards, > > Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process
What I'm having in mind is if the command is already in history. A wrong history reference can execute a command with "--yes-i-really-mean-it" even though you really don't mean it. Been there. For an OSD this is maybe tolerable, but for an entire cluster ... not really. Some things need to be hard to limit the blast radius of a typo (or attacker). For example, when issuing such a command the first time, the cluster could print a nonce that needs to be included in such a command to make it happen and which is only valid once for this exact command, so one actually needs to type something new every time to destroy stuff. An exception could be if a "safe-to-destroy" query for any daemon (pool etc.) returns true. I would still not allow an entire cluster to be wiped with a single command. In a single step, only allow to destroy what could be recovered in some way (there has to be some form of undo). And there should be notifications to all admins about what is going on to be able to catch malicious execution of destructive commands. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Tuesday, May 30, 2023 10:51 AM To: Frank Schilder Cc: Nico Schottelius; Redouane Kachach; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hey Frank, in regards to destroying a cluster, I'd suggest to reuse the old --yes-i-really-mean-it parameter, as it is already in use by ceph osd destroy [0]. Then it doesn't matter whether it's prod or not, if you really mean it ... ;-) Best regards, Nico [0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/ Frank Schilder writes: > Hi, I would like to second Nico's comment. What happened to the idea that a > deployment tool should be idempotent? The most natural option would be: > > 1) start install -> something fails > 2) fix problem > 3) repeat exact same deploy command -> deployment picks up at current state > (including cleaning up failed state markers) and tries to continue until next > issue (go to 2) > > I'm not sure (meaning: its a terrible idea) if its a good idea to > provide a single command to wipe a cluster. Just for the fat finger > syndrome. This seems safe only if it would be possible to mark a > cluster as production somehow (must be sticky, that is, cannot be > unset), which prevents a cluster destroy command (or any too dangerous > command) from executing. I understand the test case in the tracker, > but having such test-case utils that can run on a production cluster > and destroy everything seems a bit dangerous. > > I think destroying a cluster should be a manual and tedious process > and figuring out how to do it should be part of the learning > experience. So my answer to "how do I start over" would be "go figure > it out, its an important lesson". > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Nico Schottelius > Sent: Friday, May 26, 2023 10:40 PM > To: Redouane Kachach > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap > process > > > Hello Redouane, > > much appreciated kick-off for improving cephadm. I was wondering why > cephadm does not use a similar approach to rook in the sense of "repeat > until it is fixed?" > > For the background, rook uses a controller that checks the state of the > cluster, the state of monitors, whether there are disks to be added, > etc. It periodically restarts the checks and when needed shifts > monitors, creates OSDs, etc. > > My question is, why not have a daemon or checker subcommand of cephadm > that a) checks what the current cluster status is (i.e. cephadm > verify-cluster) and b) fixes the situation (i.e. cephadm > verify-and-fix-cluster)? > > I think that option would be much more beneficial than the other two > suggested ones. > > Best regards, > > Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Important: RGW multisite bug may silently corrupt encrypted objects on replication
Hello Casey, Thanks for the information! Can you please confirm that this is only an issue when using “rgw_crypt_default_encryption_key” config opt that says “testing only” in the documentation [1] to enable encryption and not when using Barbican or Vault as KMS or using SSE-C with the S3 API? [1] https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only > On 26 May 2023, at 22:45, Casey Bodley wrote: > > Our downstream QE team recently observed an md5 mismatch of replicated > objects when testing rgw's server-side encryption in multisite. This > corruption is specific to s3 multipart uploads, and only affects the > replicated copy - the original object remains intact. The bug likely > affects Ceph releases all the way back to Luminous where server-side > encryption was first introduced. > > To expand on the cause of this corruption: Encryption of multipart > uploads requires special handling around the part boundaries, because > each part is uploaded and encrypted separately. In multisite, objects > are replicated in their encrypted form, and multipart uploads are > replicated as a single part. As a result, the replicated copy loses > its knowledge about the original part boundaries required to decrypt > the data correctly. > > We don't have a fix yet, but we're tracking it in > https://tracker.ceph.com/issues/46062. The fix will only modify the > replication logic, so won't repair any objects that have already > replicated incorrectly. We'll need to develop a radosgw-admin command > to search for affected objects and reschedule their replication. > > In the meantime, I can only advise multisite users to avoid using > encryption for multipart uploads. If you'd like to scan your cluster > for existing encrypted multipart uploads, you can identify them with a > s3 HeadObject request. The response would include a > x-amz-server-side-encryption header, and the ETag header value (with > "s removed) would be longer than 32 characters (multipart ETags are in > the special form "-"). Take care not to delete the > corrupted replicas, because an active-active multisite configuration > would go on to delete the original copy. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] slow mds requests with random read test
Hi, We are performing couple performance tests on CephFS using fio. fio is run in k8s pod and 3 pods will be up running mounting the same pvc to CephFS volume. Here is command line for random read: fio -direct=1 -iodepth=128 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=5 -runtime=500 -group_reporting -directory=/tmp/cache -name=Rand_Read_Testing_$BUILD_TIMESTAMP The random read is performed very slow. Here is the cluster log from dashboard: 5/30/23 8:13:16 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:13:16 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:13:16 PM [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 33 secs 5/30/23 8:13:16 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:13:14 PM [WRN] Health check update: 2 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:13:13 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:13:08 PM [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:13:08 PM [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 5/30/23 8:13:08 PM [WRN] slow request 34.213327 seconds old, received at 2023-05-30T12:12:33.951399+: client_request(client.270564:1406144 getattr pAsLsXsFs #0x70103d0 2023-05-30T12:12:33.947323+ caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting 5/30/23 8:13:08 PM [WRN] 1 slow requests, 1 included below; oldest blocked for > 34.213328 secs 5/30/23 8:13:07 PM [WRN] slow request 33.169703 seconds old, received at 2023-05-30T12:12:33.952078+: peer_request:client.270564:1406144 currently dispatched 5/30/23 8:13:07 PM [WRN] 1 slow requests, 1 included below; oldest blocked for > 33.169704 secs 5/30/23 8:13:04 PM [INF] Cluster is now healthy 5/30/23 8:13:04 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:13:04 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:13:04 PM [INF] MDS health message cleared (mds.?): 9 slow metadata IOs are blocked > 30 secs, oldest blocked for 45 secs 5/30/23 8:13:04 PM [INF] MDS health message cleared (mds.?): 2 slow requests are blocked > 30 secs 5/30/23 8:12:57 PM [WRN] 2 slow requests, 0 included below; oldest blocked for > 44.954377 secs 5/30/23 8:12:52 PM [WRN] 2 slow requests, 0 included below; oldest blocked for > 39.954313 secs 5/30/23 8:12:48 PM [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST) 5/30/23 8:12:47 PM [WRN] slow request 34.935921 seconds old, received at 2023-05-30T12:12:12.185614+: client_request(client.270564:1406139 create #0x701045b/atomic7966567911433736706tmp 2023-05-30T12:12:12.182999+ caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply 5/30/23 8:12:47 PM [WRN] slow request 34.954254 seconds old, received at 2023-05-30T12:12:12.167281+: client_request(client.270564:1406138 rename #0x7010457/build.xml #0x7010457/atomic6590865221269854506tmp 2023-05-30T12:12:12.162999+ caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply 5/30/23 8:12:47 PM [WRN] 2 slow requests, 2 included below; oldest blocked for > 34.954254 secs 5/30/23 8:12:44 PM [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO) 5/30/23 8:12:41 PM [INF] Cluster is now healthy 5/30/23 8:12:41 PM [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 5/30/23 8:12:41 PM [INF] MDS health message cleared (mds.?): 1 slow requests are blocked > 30 secs 5/30/23 8:12:40 PM [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs) 5/30/23 8:12:40 PM [INF] MDS health message cleared (mds.?): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 38 secs However, random write test is performing very good. Any suggestions on the problem? Thanks, Ben ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process
+1 Michel Le 30/05/2023 à 11:23, Frank Schilder a écrit : What I'm having in mind is if the command is already in history. A wrong history reference can execute a command with "--yes-i-really-mean-it" even though you really don't mean it. Been there. For an OSD this is maybe tolerable, but for an entire cluster ... not really. Some things need to be hard to limit the blast radius of a typo (or attacker). For example, when issuing such a command the first time, the cluster could print a nonce that needs to be included in such a command to make it happen and which is only valid once for this exact command, so one actually needs to type something new every time to destroy stuff. An exception could be if a "safe-to-destroy" query for any daemon (pool etc.) returns true. I would still not allow an entire cluster to be wiped with a single command. In a single step, only allow to destroy what could be recovered in some way (there has to be some form of undo). And there should be notifications to all admins about what is going on to be able to catch malicious execution of destructive commands. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Tuesday, May 30, 2023 10:51 AM To: Frank Schilder Cc: Nico Schottelius; Redouane Kachach; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hey Frank, in regards to destroying a cluster, I'd suggest to reuse the old --yes-i-really-mean-it parameter, as it is already in use by ceph osd destroy [0]. Then it doesn't matter whether it's prod or not, if you really mean it ... ;-) Best regards, Nico [0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/ Frank Schilder writes: Hi, I would like to second Nico's comment. What happened to the idea that a deployment tool should be idempotent? The most natural option would be: 1) start install -> something fails 2) fix problem 3) repeat exact same deploy command -> deployment picks up at current state (including cleaning up failed state markers) and tries to continue until next issue (go to 2) I'm not sure (meaning: its a terrible idea) if its a good idea to provide a single command to wipe a cluster. Just for the fat finger syndrome. This seems safe only if it would be possible to mark a cluster as production somehow (must be sticky, that is, cannot be unset), which prevents a cluster destroy command (or any too dangerous command) from executing. I understand the test case in the tracker, but having such test-case utils that can run on a production cluster and destroy everything seems a bit dangerous. I think destroying a cluster should be a manual and tedious process and figuring out how to do it should be part of the learning experience. So my answer to "how do I start over" would be "go figure it out, its an important lesson". Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Friday, May 26, 2023 10:40 PM To: Redouane Kachach Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hello Redouane, much appreciated kick-off for improving cephadm. I was wondering why cephadm does not use a similar approach to rook in the sense of "repeat until it is fixed?" For the background, rook uses a controller that checks the state of the cluster, the state of monitors, whether there are disks to be added, etc. It periodically restarts the checks and when needed shifts monitors, creates OSDs, etc. My question is, why not have a daemon or checker subcommand of cephadm that a) checks what the current cluster status is (i.e. cephadm verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)? I think that option would be much more beneficial than the other two suggested ones. Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Important: RGW multisite bug may silently corrupt encrypted objects on replication
On Tue, May 30, 2023 at 8:22 AM Tobias Urdin wrote: > > Hello Casey, > > Thanks for the information! > > Can you please confirm that this is only an issue when using > “rgw_crypt_default_encryption_key” > config opt that says “testing only” in the documentation [1] to enable > encryption and not when using > Barbican or Vault as KMS or using SSE-C with the S3 API? unfortunately, all flavors of server-side encryption (SSE-C, SSE-KMS, SSE-S3, and rgw_crypt_default_encryption_key) are affected by this bug, as they share the same encryption logic. the main difference is where they get the key > > [1] > https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only > > > On 26 May 2023, at 22:45, Casey Bodley wrote: > > > > Our downstream QE team recently observed an md5 mismatch of replicated > > objects when testing rgw's server-side encryption in multisite. This > > corruption is specific to s3 multipart uploads, and only affects the > > replicated copy - the original object remains intact. The bug likely > > affects Ceph releases all the way back to Luminous where server-side > > encryption was first introduced. > > > > To expand on the cause of this corruption: Encryption of multipart > > uploads requires special handling around the part boundaries, because > > each part is uploaded and encrypted separately. In multisite, objects > > are replicated in their encrypted form, and multipart uploads are > > replicated as a single part. As a result, the replicated copy loses > > its knowledge about the original part boundaries required to decrypt > > the data correctly. > > > > We don't have a fix yet, but we're tracking it in > > https://tracker.ceph.com/issues/46062. The fix will only modify the > > replication logic, so won't repair any objects that have already > > replicated incorrectly. We'll need to develop a radosgw-admin command > > to search for affected objects and reschedule their replication. > > > > In the meantime, I can only advise multisite users to avoid using > > encryption for multipart uploads. If you'd like to scan your cluster > > for existing encrypted multipart uploads, you can identify them with a > > s3 HeadObject request. The response would include a > > x-amz-server-side-encryption header, and the ETag header value (with > > "s removed) would be longer than 32 characters (multipart ETags are in > > the special form "-"). Take care not to delete the > > corrupted replicas, because an active-active multisite configuration > > would go on to delete the original copy. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RBD imagem mirroring doubt
Hello guys, What would happen if we set up an RBD mirroring configuration, and in the target system (the system where the RBD image is mirrored) we create snapshots of this image? Would that cause some problems? Also, what happens if we delete the source RBD image? Would that trigger a deletion in the target system RBD image as well? Thanks in advance! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Custom CRUSH maps HOWTO?
Hi folks! I have a Ceph production 17.2.6 cluster with 6 machines in it - four newer, faster machines with 4x3.84TB NVME drives each, and two with 24x1.68TB SAS disks each. I know I should have done something smart with the CRUSH maps for this up front, but until now I have shied away from CRUSH maps as they sound really complex. Right now my cluster's performance, especially write performance, is not what it needs to be, and I am looking for advice: 1. How should I be structuring my crush map, and why? 2. How does one actually edit and manage a CRUSH map? What /commands/ does one use? This isn't clear at all in the documentation. Are there any GUI tools out there for managing CRUSH? 3. Is this going to impact production performance or availability while I'm configuring it? I have tens of thousands of users relying on this thing, so I can't take any risks. Thanks in advance! -- Regards, Thorne Lawler - Senior System Administrator *DDNS* | ABN 76 088 607 265 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 P +61 499 449 170 _DDNS /_*Please note:* The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. _If you are not the intended recipient any use, disclosure or copying of this email is unauthorised. _If you received this email in error, please notify Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this matter and delete all copies of this transmission together with any attachments. / ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CEPH Version choice
Hi Marc, I uploaded all scripts and a rudimentary readme to https://github.com/frans42/cephfs-bench . I hope it is sufficient to get started. I'm afraid its very much tailored to our deployment and I can't make it fully configurable anytime soon. I hope it serves a purpose though - at least I discovered a few bugs with it. We actually kept the benchmark running through an upgrade from mimic to octopus. Was quite interesting to see how certain performance properties change with that. This benchmark makes it possible to compare versions with live timings coming in. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Sent: Monday, May 15, 2023 11:28 PM To: Frank Schilder Subject: RE: [ceph-users] Re: CEPH Version choice > I planned to put it on-line. The hold-back is that the main test is un- > taring a nasty archive and this archive might contain personal > information, so I can't just upload it as is. I can try to put together > a similar archive from public sources. Please give me a bit of time. I'm > also a bit under stress right now with our users being hit by an FS meta > data corruption. That's also why I'm a bit trigger happy. > Ok thanks, very nice, no hurry!!! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] Custom CRUSH maps HOWTO?
I’m going to start by assuming your pool(s) are deployed with the default 3 replicas and a min_size of 2. The quickest and safest thing you can do to potentially realize some improvement, is set the primary-affinity for all of your HDD-based OSDs to zero. https://docs.ceph.com/en/quincy/rados/operations/crush-map/#primary-affinity Something like: for osd in $(ceph osd ls-tree SAS-NODE1); do ceph osd primary-affinity $osd 0.0; done And of course repeat that for the other node. That will have low impact on your users as ceph will start prioritizing reads from the fast NVMEs, and the slow ones will only have to do writes. However, ceph may already be doing that, and if your SAS based hosts do not have a fast disks for the block DB and WAL (write-ahead log), any time 2 (or more) SAS disks are involved in a PG, your writes will still be as slow as the fastest HDD. It is best when ceph has identical size and performance OSDs. When you’re going to mix very fast disks, with relatively slow disks the next best thing is to have twice as much fast storage as slow. If you have enough capacity available such that the total data STORED (add up from ceph df) is < 3.84*4*2*0.7 = ~21.5TB, I’d suggest creating rack buckets in your crush map, so there’s 3 racks, each with 2 hosts, so that each PG will only have one slow disk. The down side to that is, you are basically abandoning ~50TB of HDD capacity, your effective maximum RAW capacity ends up only ~92TB, and you’ll start getting near-full warnings between 75 and 80TB RAW or around 25-27TB stored. The process for setting that would be adding 3 rack buckets, and then moving the host buckets into the rack buckets: https://docs.ceph.com/en/quincy/rados/operations/crush-map/#add-a-bucket That will cause a lot of data movement, so you should try to do it at a time when client i/o is expected to be low. Ceph will do its best to limit the impact to client i/o caused by this backfill, but if your writes are already poor, they’ll definitely be worse during the movement. If that capacity is going to be an issue, the recommended fixes get more complicated and risky. However, the best thing you can do, even if you do add the suggested racks to your crush map, would be to get 2 NVMEs (or SSDs) for each of your SAS hosts to serve as db_devices for the HDDs. You’ll have to remove and recreate those OSDs, but you can do them in smaller batches. https://docs.ceph.com/en/quincy/cephadm/services/osd/#creating-new-osds There is a GUI ceph dashboard available. https://docs.ceph.com/en/quincy/mgr/dashboard/ It is very limited in the changes that can be made, and these types of crush map changes are definitely not for the dashboard. But it may help you get a useful view of the state of your cluster. Best of luck, Josh Beaman From: Thorne Lawler Date: Tuesday, May 30, 2023 at 9:52 AM To: ceph-users@ceph.io Subject: [EXTERNAL] [ceph-users] Custom CRUSH maps HOWTO? Hi folks! I have a Ceph production 17.2.6 cluster with 6 machines in it - four newer, faster machines with 4x3.84TB NVME drives each, and two with 24x1.68TB SAS disks each. I know I should have done something smart with the CRUSH maps for this up front, but until now I have shied away from CRUSH maps as they sound really complex. Right now my cluster's performance, especially write performance, is not what it needs to be, and I am looking for advice: 1. How should I be structuring my crush map, and why? 2. How does one actually edit and manage a CRUSH map? What /commands/ does one use? This isn't clear at all in the documentation. Are there any GUI tools out there for managing CRUSH? 3. Is this going to impact production performance or availability while I'm configuring it? I have tens of thousands of users relying on this thing, so I can't take any risks. Thanks in advance! -- Regards, Thorne Lawler - Senior System Administrator *DDNS* | ABN 76 088 607 265 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 P +61 499 449 170 _DDNS /_*Please note:* The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. _If you are not the intended recipient any use, disclosure or copying of this email is unauthorised. _If you received this email in error, please notify Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this matter and delete all copies of this transmission together with any attachments. / ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] reef v18.1.0 QE Validation status
Details of this release are summarized here: https://tracker.ceph.com/issues/61515#note-1 Release Notes - TBD Seeking approvals/reviews for: rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to merge https://github.com/ceph/ceph/pull/51788 for the core) rgw - Casey fs - Venky orch - Adam King rbd - Ilya krbd - Ilya upgrade/octopus-x - deprecated upgrade/pacific-x - known issues, Ilya, Laura? upgrade/reef-p2p - N/A clients upgrades - not run yet powercycle - Brad ceph-volume - in progress Please reply to this email with approval and/or trackers of known issues/PRs to address them. gibba upgrade was done and will need to be done again this week. LRC upgrade TBD TIA ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: slow mds requests with random read test
On Tue, May 30, 2023 at 8:42 AM Ben wrote: > > Hi, > > We are performing couple performance tests on CephFS using fio. fio is run > in k8s pod and 3 pods will be up running mounting the same pvc to CephFS > volume. Here is command line for random read: > fio -direct=1 -iodepth=128 -rw=randread -ioengine=libaio -bs=4k -size=1G > -numjobs=5 -runtime=500 -group_reporting -directory=/tmp/cache > -name=Rand_Read_Testing_$BUILD_TIMESTAMP > The random read is performed very slow. Here is the cluster log from > dashboard: > [...] > Any suggestions on the problem? Your random read workload is too extreme for your cluster of OSDs. It's causing slow metadata ops for the MDS. To resolve this we would normally suggest allocating a set of OSDs on SSDs for use by the CephFS metadata pool to isolate the worklaods. -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: reef v18.1.0 QE Validation status
On Tue, May 30, 2023 at 6:54 PM Yuri Weinstein wrote: > > Details of this release are summarized here: > > https://tracker.ceph.com/issues/61515#note-1 > Release Notes - TBD > > Seeking approvals/reviews for: > > rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to > merge https://github.com/ceph/ceph/pull/51788 for > the core) > rgw - Casey > fs - Venky > orch - Adam King > rbd - Ilya > krbd - Ilya rbd and krbd approved. Thanks, Ilya ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Custom CRUSH maps HOWTO?
What kind of pool are you using, or do you have different pools for different purposes... Do you have cephfs or rbd only pools etc... describe your setup. It is generally best practice to create new rules and apply them to pools and not to modify existing pools, but that is possible as well. Below is one relatively simple thing to do but it is just a proposal and it may not fit your needs so take it with CAUTION!!! If i did math right you have roughly 51TB SAS and 61TB NVMe, easiest thing to do is what you can do even from webgui create new crush map for replicated or EC pool depending which one you're currently using, set failure domain to HOST, and set device class to NVMe, than repeat the process for HDD only pool. After that you can apply new crush configuration to the existing pool, doing so will cause a lot of data movement which may be short or long depending on your network and hard drive speeds, also depending on your client needs if the cluster is usually under heavy load then clients will definitely notice this action. So doing it that way you would have two sets of disks to be used for different purposes one for fast storage and one for slow storage. Anyway doing any action of this sort I'd test it in at least VM environment if you dont have some test cluster to run it on previously. However if your need is to have large chunky pool there are certain configurations to tell to cluster to place 1 or two replicas on fast drives and remaining replicas on other device type, but don't take this for granted i'm not 100% sure, as far as i know Ceph waits for confirmation of all drives to finish writing process to acknowledge to client that file/object is stored, so i'm not sure that you would benefit from setup like that. Kind regards, Nino On Tue, May 30, 2023 at 4:53 PM Thorne Lawler wrote: > Hi folks! > > I have a Ceph production 17.2.6 cluster with 6 machines in it - four > newer, faster machines with 4x3.84TB NVME drives each, and two with > 24x1.68TB SAS disks each. > > I know I should have done something smart with the CRUSH maps for this > up front, but until now I have shied away from CRUSH maps as they sound > really complex. > > Right now my cluster's performance, especially write performance, is not > what it needs to be, and I am looking for advice: > > 1. How should I be structuring my crush map, and why? > > 2. How does one actually edit and manage a CRUSH map? What /commands/ > does one use? This isn't clear at all in the documentation. Are there > any GUI tools out there for managing CRUSH? > > 3. Is this going to impact production performance or availability while > I'm configuring it? I have tens of thousands of users relying on this > thing, so I can't take any risks. > > Thanks in advance! > > -- > > Regards, > > Thorne Lawler - Senior System Administrator > *DDNS* | ABN 76 088 607 265 > First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 > P +61 499 449 170 > > _DDNS > > /_*Please note:* The information contained in this email message and any > attached files may be confidential information, and may also be the > subject of legal professional privilege. _If you are not the intended > recipient any use, disclosure or copying of this email is unauthorised. > _If you received this email in error, please notify Discount Domain Name > Services Pty Ltd on 03 9815 6868 to report this matter and delete all > copies of this transmission together with any attachments. / > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: BlueStore fragmentation woes
Ok, I restarted it May 25th, ~11:30, let it run over the long weekend and just checked on it. Data attached. May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) allocation stats probe 107: cnt: 17991 fr ags: 17991 size: 32016760832 May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -1: 20267, 20267, 39482425344 May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -3: 19737, 19737, 37299027968 May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -7: 18498, 18498, 32395558912 May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -11: 20373, 20373, 35302801408 May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-21T18:24:34.040+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -27: 19072, 19072, 33645854720 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) allocation stats probe 108: cnt: 24594 fr ags: 24594 size: 56951898112 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -1: 17991, 17991, 32016760832 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -2: 20267, 20267, 39482425344 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -4: 19737, 19737, 37299027968 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -12: 20373, 20373, 35302801408 May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-22T18:24:34.057+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -28: 19072, 19072, 33645854720 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) allocation stats probe 109: cnt: 24503 frags: 24503 size: 58141900800 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -1: 24594, 24594, 56951898112 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -3: 20267, 20267, 39482425344 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -5: 19737, 19737, 37299027968 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -13: 20373, 20373, 35302801408 May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-23T18:24:34.095+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -29: 19072, 19072, 33645854720 May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) allocation stats probe 110: cnt: 27637 frags: 27637 size: 63777406976 May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -1: 24503, 24503, 58141900800 May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -2: 24594, 24594, 56951898112 May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+ 7f53603fc700 0 bluestore(/var/lib/ceph/osd/ceph-183) probe -6: 19737, 19737, 37299027968 May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: debug 2023-05-24T18:24:34.105+ 7f53603fc700 0 bluestore(/var/
[ceph-users] Re: reef v18.1.0 QE Validation status
Rook daily CI is passing against the image quay.io/ceph/daemon-base:latest-main-devel, which means the Reef release is looking good from Rook's perspective: With the Reef release we need to have the tags soon: quay.io/ceph/daemon-base:latest-reef-devel quay.io/ceph/ceph:v18 Guillaume, will these happen automatically, or do we need some work done in ceph-container? Thanks, Travis On Tue, May 30, 2023 at 10:54 AM Yuri Weinstein wrote: > Details of this release are summarized here: > > https://tracker.ceph.com/issues/61515#note-1 > Release Notes - TBD > > Seeking approvals/reviews for: > > rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to > merge https://github.com/ceph/ceph/pull/51788 for > the core) > rgw - Casey > fs - Venky > orch - Adam King > rbd - Ilya > krbd - Ilya > upgrade/octopus-x - deprecated > upgrade/pacific-x - known issues, Ilya, Laura? > upgrade/reef-p2p - N/A > clients upgrades - not run yet > powercycle - Brad > ceph-volume - in progress > > Please reply to this email with approval and/or trackers of known > issues/PRs to address them. > > gibba upgrade was done and will need to be done again this week. > LRC upgrade TBD > > TIA > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph client version vs server version inter-operability
Hi, We are running a ceph cluster that is currently on Luminous. At this point most of our clients are also Luminous, but as we provision new client hosts we are using client versions that are more recent (e.g Octopus, Pacific and more recently Quincy). Is this safe? Is there a known list of what client versions are compatible with what server version? We are only using RBD and are specifying rbd_default_features (the same) on all server and client hosts. regards Mark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Custom CRUSH maps HOWTO?
Thanks to Anthony D'Atri, Joshua Beaman and Nino Kotur. TL;DR- I need to ditch the spinning rust. As long as all my pools are using all the OSDs (currently necessary) this is not really a tuning problem- just a consequence of adding awful old recycled disks to my shiny NVME. To answer a few questions: * I've tried KRBD and librbd, also iSCSI, NFS, CephFS on multiple different physical and virtual OSes, both *nix and Windows. * I've benchtested with fio and rbd bench. * Yes I'm using the default replicas and min_size. * Yes, I already set primary_affinity to zero for the spinning disks. * No I can't move disks around or add flash disks to the older machines with the spinning storage in them. Our hardware vendors are kinda butts. I have gone back to my hardware vendor to see if they can do a much better price on more NVME 12 months later. Fingers crossed. Thanks again for everyone's quick responses! On 31/05/2023 12:51 am, Thorne Lawler wrote: Hi folks! I have a Ceph production 17.2.6 cluster with 6 machines in it - four newer, faster machines with 4x3.84TB NVME drives each, and two with 24x1.68TB SAS disks each. I know I should have done something smart with the CRUSH maps for this up front, but until now I have shied away from CRUSH maps as they sound really complex. Right now my cluster's performance, especially write performance, is not what it needs to be, and I am looking for advice: 1. How should I be structuring my crush map, and why? 2. How does one actually edit and manage a CRUSH map? What /commands/ does one use? This isn't clear at all in the documentation. Are there any GUI tools out there for managing CRUSH? 3. Is this going to impact production performance or availability while I'm configuring it? I have tens of thousands of users relying on this thing, so I can't take any risks. Thanks in advance! -- Regards, Thorne Lawler - Senior System Administrator *DDNS* | ABN 76 088 607 265 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172 P +61 499 449 170 _DDNS /_*Please note:* The information contained in this email message and any attached files may be confidential information, and may also be the subject of legal professional privilege. _If you are not the intended recipient any use, disclosure or copying of this email is unauthorised. _If you received this email in error, please notify Discount Domain Name Services Pty Ltd on 03 9815 6868 to report this matter and delete all copies of this transmission together with any attachments. / ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io