[ceph-users] Re: Planning cluster
Never ever use osd pool default min size = 1 this will break your neck and does not make sense really. :-) On Mon, Jul 10, 2023 at 7:33 PM Dan van der Ster wrote: > > Hi Jan, > > On Sun, Jul 9, 2023 at 11:17 PM Jan Marek wrote: > > > Hello, > > > > I have a cluster, which have this configuration: > > > > osd pool default size = 3 > > osd pool default min size = 1 > > > > Don't use min_size = 1 during regular stable operations. Instead, use > min_size = 2 to ensure data safety, and then you can set the pool to > min_size = 1 manually in the case of an emergency. (E.g. in case the 2 > copies fail and will not be recoverable). > > > > I have 5 monitor nodes and 7 OSD nodes. > > > > 3 monitors is probably enough. Put 2 in the same DC with 2 replicas, and > the other in the DC with 1 replica. > > > > I have changed a crush map to divide ceph cluster to two > > datacenters - in the first one will be a part of cluster with 2 > > copies of data and in the second one will be part of cluster > > with one copy - only emergency. > > > > I still have this cluster in one > > > > This cluster have a 1 PiB of raw data capacity, thus it is very > > expensive add a further 300TB capacity to have 2+2 data redundancy. > > > > Will it works? > > > > If I turn off the 1/3 location, will it be operational? > > > Yes the PGs should be active and accept IO. But the cluster will be > degraded, it cannot stay in this state permanently. (You will need to > recover the 3rd replica or change the crush map). > > > > > I > > believe, it is a better choose, it will. And what if "die" 2/3 > > location? > > > with min_size = 2, the PG wil be inactive. but the data will be safe. If > this happens, then set min_size=1 to activate the PGs. > Mon will not have quorum though -- you need a plan for that. And also plan > where you put your MDSs. > > -- dan > > __ > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > On this cluster is pool with cephfs - this is a main > > part of CEPH. > > > > Many thanks for your notices. > > > > Sincerely > > Jan Marek > > -- > > Ing. Jan Marek > > University of South Bohemia > > Academic Computer Centre > > Phone: +420389032080 > > http://www.gnu.org/philosophy/no-word-attachments.cs.html > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW dynamic resharding blocks write ops
Hi again, I got the log excerpt with the rgw error message: s3:put_obj block_while_resharding ERROR: bucket is still resharding, please retry Below is the message in context, I don't see a return code though, only 206 for the get requests. Unfortunately, we only have a recorded putty session so the information is limited. Does it help anyway to get to the bottom of this? One more thing, we noticed that rgw_reshard_bucket_lock_duration was changed somewhere in Nautilus from 120 to 360 seconds. They haven't reported these errors before the upgrade, so I feel like either they were lucky and simply didn't get into this resharding while trying to write or the lock_duration was actually only 120 seconds which may have been okay for the application. It's all just guessing at the moment, we don't have config dumps of all configs in place from before the upgrade, at least not that I'm aware of. Anyway, I still need to discuss with them if disabling dynamic resharding is the way to go (and then manually reshard during maintenance) and preshard new buckets if they can tell how many objects are expected in the new bucket. The error message repeats for around 3 and a half minutes, apparently this is the time it took to reshard the bucket. Maybe reducing the lock_duration to 120 seconds could also help here, but I wonder what the consequence would be. Would it stop resharding after 2 minutes and leave something orphaned behind or how is that lock_duration impacting the process exactly? One more question, I see these INFO messages "found lock on ", but the error message "bucket is still resharding" doesn't contain the bucket name. Because the INFO message I saw a lot, not only during the application timeout errors, so they don't seem related. How can I tell which bucket is throwing the error during resharding? Thanks, Eugen 709+ 7f23a77be700 1 beast: 0x7f24dc1d85d0: - ICAS_nondicom [06/Jul/2023:00:06:58.613 +] "GET /shaprod-lts/20221114193114-20521114-77bfd> 305+ 7f239dfab700 0 req 17231063235781096603 91.367729187s s3:put_obj block_while_resharding ERROR: bucket is still resharding, please retry 313+ 7f23246b8700 0 req 13860563368404093374 91.383728027s s3:put_obj block_while_resharding ERROR: bucket is still resharding, please retry 313+ 7f2382f75700 0 req 17231063235781096603 91.375732422s s3:put_obj NOTICE: resharding operation on bucket index detected, blocking 313+ 7f231669c700 0 req 13860563368404093374 91.383728027s s3:put_obj NOTICE: resharding operation on bucket index detected, blocking 365+ 7f23a0fb1700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 365+ 7f22fe66c700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 365+ 7f237c768700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 365+ 7f2361732700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 409+ 7f231669c700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 409+ 7f2382f75700 0 INFO: RGWReshardLock::lock found lock on jivex-002-p2s3:d2c448cb-4f31-4f28-ac93-3941982d2f46.284023468.1 to be held by another RGW p> 669+ 7f22e3e37700 0 req 18215535743838894575 91.735725403s s3:put_obj block_while_resharding ERROR: bucket is still resharding, please retry 809+ 7f2326ebd700 0 req 18215535743838894575 91.875732422s s3:put_obj NOTICE: resharding operation on bucket index detected, blocking Zitat von Eugen Block : We had a quite small window yesterday to debug, I found the error messages but we didn't collect the logs yet, I will ask them to do that on Monday. I *think* the error was something like this: resharding operation on bucket index detected, blocking block_while_resharding ERROR: bucket is still resharding, please retry But I'll verify and ask them to collect the logs. [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/4XMMPSHW7OQ3NU7IE4QFK6A2QVDQ2CJR/ Zitat von Casey Bodley : while a bucket is resharding, rgw will retry several times internally to apply the write before returning an error to the client. while most buckets can be resharded within seconds, very large buckets may hit these timeouts. any other cause of slow osd ops could also have that effect. it can be helpful to pre-shard very large buckets to avoid these resharding delays can you tell which error code was returned to the client there? it should be a retryable error, and many http clients have retry logic to prevent these errors from reaching the app
[ceph-users] cephadm problem with MON deployment
Hello I'm trying to add MONs in advance of a planned downtime. This has actually ended up removing an existing MON, which isn't helpful. The error I'm seeing is: Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not exist (create_if_missing is false) error opening mon data directory at '/var/lib/ceph/mon/ceph-': (22) Invalid argument It appears that the fsid is being stripped, because the directory was there. It's now in /var/lib/ceph//removed This appears to be similar to: https://tracker.ceph.com/issues/45167 which was closed for lack of a reproducer. The command I ran was: sudo ceph orch apply mon --placement="comma-separated hostname list" after running that with "--dry-run". Would be grateful for some advice here - I wasn't expecting to reduce the MON count. Best Wishes, Adam ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm problem with MON deployment
Forgot to say we're on Pacific 16.2.13. On Tue, 11 Jul 2023 at 08:55, Adam Huffman wrote: > Hello > > I'm trying to add MONs in advance of a planned downtime. > > This has actually ended up removing an existing MON, which isn't helpful. > > The error I'm seeing is: > > Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not > exist (create_if_missing is false) > error opening mon data directory at '/var/lib/ceph/mon/ceph-': > (22) Invalid argument > > It appears that the fsid is being stripped, because the directory was > there. > It's now in /var/lib/ceph//removed > > This appears to be similar to: > https://tracker.ceph.com/issues/45167 > which was closed for lack of a reproducer. > > The command I ran was: > > sudo ceph orch apply mon --placement="comma-separated hostname list" > > after running that with "--dry-run". > > Would be grateful for some advice here - I wasn't expecting to reduce the > MON count. > > Best Wishes, > Adam > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm problem with MON deployment
Okay, this turned out to be down to cephadm rejecting the request because the new MON was not in the list of public networks. I had seen that error in the logs, but it looked as though it was a consequence of the store.db error, rather than a cause. After adding the new network, and repeating the request, the new MONs were created. On Tue, 11 Jul 2023 at 08:57, Adam Huffman wrote: > Forgot to say we're on Pacific 16.2.13. > > On Tue, 11 Jul 2023 at 08:55, Adam Huffman > wrote: > >> Hello >> >> I'm trying to add MONs in advance of a planned downtime. >> >> This has actually ended up removing an existing MON, which isn't helpful. >> >> The error I'm seeing is: >> >> Invalid argument: /var/lib/ceph/mon/ceph-/store.db: does not >> exist (create_if_missing is false) >> error opening mon data directory at '/var/lib/ceph/mon/ceph-': >> (22) Invalid argument >> >> It appears that the fsid is being stripped, because the directory was >> there. >> It's now in /var/lib/ceph//removed >> >> This appears to be similar to: >> https://tracker.ceph.com/issues/45167 >> which was closed for lack of a reproducer. >> >> The command I ran was: >> >> sudo ceph orch apply mon --placement="comma-separated hostname list" >> >> after running that with "--dry-run". >> >> Would be grateful for some advice here - I wasn't expecting to reduce the >> MON count. >> >> Best Wishes, >> Adam >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephadm fails to deploy loki with promtail correctly
I'm not sure if it's a bug with Cephadm, but it looks like it. I've got Loki deployed on one machine and Promtail deployed to all machines. After creating a login, I can view only the logs on the hosts on which Loki is running. When inspecting the Promtail configuration, the configured URL for Loki is set to http://host.containers.internal:3100. Shouldn't this be configured by Cephadm and pointing to the Loki host? This looks a lot like the issues with incorrectly setting the Grafana or Prometheus URL's, bug 57018 is created for this. Should I create another bug report? And does someone know a workaround to set the correct URL for the time being? Best regards, Sake ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] OSD memory usage after cephadm adoption
Hi everyone, We recently migrate a cluster from ceph-ansible to cephadm. Everything went as expected. But now we have some alerts on high memory usage. Cluster is running ceph 16.2.13. Of course, after adoption OSDs ended up in the zone: NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd 88 7m ago - But the weirdest thing I observed, is that the OSDs seem to use more memory that the mem limit: NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID osd.0 running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 ca07fe74a0fa osd.1 running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 6223ed8e34e9 osd.10 running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 073ddc0d7391 osd.100 running (5d) 2m ago 5d 7118M 6400M 16.2.13 327f301eff51 b7f9238c0c24 Does anybody knows why OSDs would use more memory than the limit? Thanks Luis Domingues Proton AG ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MON sync time depends on outage duration
I'm not so sure anymore if that could really help here. The dump-keys output from the mon contains 42 million osd_snap prefix entries, 39 million of them are "purged_snap" keys. I also compared to other clusters as well, those aren't tombstones but expected "history" of purged snapshots. So I don't think removing a couple of hundred trash snapshots will actually reduce the number of osd_snap keys. At least doubling the payload_size seems to have a positive impact. The compaction during the sync has a negative impact, of course, same as not having the mon store on SSDs. I'm currently playing with a test cluster, removing all "purged_snap" entries from the mon db (not finished yet) to see what that will do with the mon and if it will even start correctly. But has anyone done that, removing keys from the mon store? Not sure what to expect yet... Zitat von Dan van der Ster : Oh yes, sounds like purging the rbd trash will be the real fix here! Good luck! __ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com On Mon, Jul 10, 2023 at 6:10 AM Eugen Block wrote: Hi, I got a customer response with payload size 4096, that made things even worse. The mon startup time was now around 40 minutes. My doubts wrt decreasing the payload size seem confirmed. Then I read Dan's response again which also mentions that the default payload size could be too small. So I asked them to double the default (2M instead of 1M) and am now waiting for a new result. I'm still wondering why this only happens when the mon is down for more than 5 minutes. Does anyone have an explanation for that time factor? Another thing they're going to do is to remove lots of snapshot tombstones (rbd mirroring snapshots in the trash namespace), maybe that will reduce the osd_snap keys in the mon db, which then would increase the startup time. We'll see... Zitat von Eugen Block : > Thanks, Dan! > >> Yes that sounds familiar from the luminous and mimic days. >> The workaround for zillions of snapshot keys at that time was to use: >> ceph config set mon mon_sync_max_payload_size 4096 > > I actually did search for mon_sync_max_payload_keys, not bytes so I > missed your thread, it seems. Thanks for pointing that out. So the > defaults seem to be these in Octopus: > > "mon_sync_max_payload_keys": "2000", > "mon_sync_max_payload_size": "1048576", > >> So it could be in your case that the sync payload is just too small to >> efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon >> you should be able to understand what is taking so long, and tune >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > > I'm confused, if the payload size is too small, why would decreasing > it help? Or am I misunderstanding something? But it probably won't > hurt to try it with 4096 and see if anything changes. If not we can > still turn on debug logs and take a closer look. > >> And additional to Dan suggestion, the HDD is not a good choices for >> RocksDB, which is most likely the reason for this thread, I think >> that from the 3rd time the database just goes into compaction >> maintenance > > Believe me, I know... but there's not much they can currently do > about it, quite a long story... But I have been telling them that > for months now. Anyway, I will make some suggestions and report back > if it worked in this case as well. > > Thanks! > Eugen > > Zitat von Dan van der Ster : > >> Hi Eugen! >> >> Yes that sounds familiar from the luminous and mimic days. >> >> Check this old thread: >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/ >> (that thread is truncated but I can tell you that it worked for Frank). >> Also the even older referenced thread: >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/ >> >> The workaround for zillions of snapshot keys at that time was to use: >> ceph config set mon mon_sync_max_payload_size 4096 >> >> That said, that sync issue was supposed to be fixed by way of adding the >> new option mon_sync_max_payload_keys, which has been around since nautilus. >> >> So it could be in your case that the sync payload is just too small to >> efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon >> you should be able to understand what is taking so long, and tune >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. >> >> Good luck! >> >> Dan >> >> __ >> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com >> >> >> >> On Thu, Jul 6, 2023 at 1:47 PM Eugen Block wrote: >> >>> Hi *, >>> >>> I'm investigating an interesting issue on two customer clusters (used >>> for mirroring) I've not solved yet, but today we finally made some >>> progress. Maybe someone has an idea where to look next, I'd appreciate >>> any hi
[ceph-users] Re: OSD memory usage after cephadm adoption
Hi Luis, Can you do a "ceph tell osd. perf dump" and "ceph daemon osd. dump_mempools"? Those should help us understand how much memory is being used by different parts of the OSD/bluestore and how much memory the priority cache thinks it has to work with. Mark On 7/11/23 4:57 AM, Luis Domingues wrote: Hi everyone, We recently migrate a cluster from ceph-ansible to cephadm. Everything went as expected. But now we have some alerts on high memory usage. Cluster is running ceph 16.2.13. Of course, after adoption OSDs ended up in the zone: NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd 88 7m ago - But the weirdest thing I observed, is that the OSDs seem to use more memory that the mem limit: NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID osd.0 running (5d) 2m ago 5d 19.7G 6400M 16.2.13 327f301eff51 ca07fe74a0fa osd.1 running (5d) 2m ago 5d 7068M 6400M 16.2.13 327f301eff51 6223ed8e34e9 osd.10 running (5d) 10m ago 5d 7235M 6400M 16.2.13 327f301eff51 073ddc0d7391 osd.100 running (5d) 2m ago 5d 7118M 6400M 16.2.13 327f301eff51 b7f9238c0c24 Does anybody knows why OSDs would use more memory than the limit? Thanks Luis Domingues Proton AG ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Best Regards, Mark Nelson Head of R&D (USA) Clyso GmbH p: +49 89 21552391 12 a: Loristraße 8 | 80335 München | Germany w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] upload-part-copy gets access denied after cluster upgrade
Hello everyone, We have a ceph cluster which was recently updated from octopus(15.2.12) to pacific(16.2.13). There has been a problem in multi part upload, which is, when doing UPLOAD_PART_COPY from a valid and existing previously uploaded part, it gets 403, ONLY WHEN IT'S CALLED BY SERVICE-USER. The same scenario gets a 200 response by a full-access sub-user, and both sub-user and service-user get 200 on the same scenario in octopus version. The policy for service user access is as below: { "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam:::user/wid:suserid" }, "Action": "*", "Resource": [ "arn:aws:s3:::bucketname", "arn:aws:s3:::bucketname/*" ] } ] } Note that this very service-user can perform a multi-part upload without any problem on both versions, only the upload_part_copy and only on pacific, it gets 403; which makes it unlikely to be an access problem. Has anyone encountered this issue? I performed multi-part upload using boto3 but there has been the same issue on other clients as well. regards ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MON sync time depends on outage duration
Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block wrote: > > I'm not so sure anymore if that could really help here. The dump-keys > output from the mon contains 42 million osd_snap prefix entries, 39 > million of them are "purged_snap" keys. I also compared to other > clusters as well, those aren't tombstones but expected "history" of > purged snapshots. So I don't think removing a couple of hundred trash > snapshots will actually reduce the number of osd_snap keys. At least > doubling the payload_size seems to have a positive impact. The > compaction during the sync has a negative impact, of course, same as > not having the mon store on SSDs. > I'm currently playing with a test cluster, removing all "purged_snap" > entries from the mon db (not finished yet) to see what that will do > with the mon and if it will even start correctly. But has anyone done > that, removing keys from the mon store? Not sure what to expect yet... > > Zitat von Dan van der Ster : > > > Oh yes, sounds like purging the rbd trash will be the real fix here! > > Good luck! > > > > __ > > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block wrote: > > > >> Hi, > >> I got a customer response with payload size 4096, that made things > >> even worse. The mon startup time was now around 40 minutes. My doubts > >> wrt decreasing the payload size seem confirmed. Then I read Dan's > >> response again which also mentions that the default payload size could > >> be too small. So I asked them to double the default (2M instead of 1M) > >> and am now waiting for a new result. I'm still wondering why this only > >> happens when the mon is down for more than 5 minutes. Does anyone have > >> an explanation for that time factor? > >> Another thing they're going to do is to remove lots of snapshot > >> tombstones (rbd mirroring snapshots in the trash namespace), maybe > >> that will reduce the osd_snap keys in the mon db, which then would > >> increase the startup time. We'll see... > >> > >> Zitat von Eugen Block : > >> > >> > Thanks, Dan! > >> > > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> > > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I > >> > missed your thread, it seems. Thanks for pointing that out. So the > >> > defaults seem to be these in Octopus: > >> > > >> > "mon_sync_max_payload_keys": "2000", > >> > "mon_sync_max_payload_size": "1048576", > >> > > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should be able to understand what is taking so long, and tune > >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. > >> > > >> > I'm confused, if the payload size is too small, why would decreasing > >> > it help? Or am I misunderstanding something? But it probably won't > >> > hurt to try it with 4096 and see if anything changes. If not we can > >> > still turn on debug logs and take a closer look. > >> > > >> >> And additional to Dan suggestion, the HDD is not a good choices for > >> >> RocksDB, which is most likely the reason for this thread, I think > >> >> that from the 3rd time the database just goes into compaction > >> >> maintenance > >> > > >> > Believe me, I know... but there's not much they can currently do > >> > about it, quite a long story... But I have been telling them that > >> > for months now. Anyway, I will make some suggestions and report back > >> > if it worked in this case as well. > >> > > >> > Thanks! > >> > Eugen > >> > > >> > Zitat von Dan van der Ster : > >> > > >> >> Hi Eugen! > >> >> > >> >> Yes that sounds familiar from the luminous and mimic days. > >> >> > >> >> Check this old thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/ > >> >> (that thread is truncated but I can tell you that it worked for Frank). > >> >> Also the even older referenced thread: > >> >> > >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/ > >> >> > >> >> The workaround for zillions of snapshot keys at that time was to use: > >> >> ceph config set mon mon_sync_max_payload_size 4096 > >> >> > >> >> That said, that sync issue was supposed to be fixed by way of adding the > >> >> new option mon_sync_max_payload_keys, which has been around since > >> nautilus. > >> >> > >> >> So it could be in your case that the sync payload is just too small to > >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and > >> debug_mon > >> >> you should
[ceph-users] Re: MON sync time depends on outage duration
It was installed with Octopus and hasn't been upgraded yet: "require_osd_release": "octopus", Zitat von Josh Baergen : Out of curiosity, what is your require_osd_release set to? (ceph osd dump | grep require_osd_release) Josh On Tue, Jul 11, 2023 at 5:11 AM Eugen Block wrote: I'm not so sure anymore if that could really help here. The dump-keys output from the mon contains 42 million osd_snap prefix entries, 39 million of them are "purged_snap" keys. I also compared to other clusters as well, those aren't tombstones but expected "history" of purged snapshots. So I don't think removing a couple of hundred trash snapshots will actually reduce the number of osd_snap keys. At least doubling the payload_size seems to have a positive impact. The compaction during the sync has a negative impact, of course, same as not having the mon store on SSDs. I'm currently playing with a test cluster, removing all "purged_snap" entries from the mon db (not finished yet) to see what that will do with the mon and if it will even start correctly. But has anyone done that, removing keys from the mon store? Not sure what to expect yet... Zitat von Dan van der Ster : > Oh yes, sounds like purging the rbd trash will be the real fix here! > Good luck! > > __ > Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com > > > > > On Mon, Jul 10, 2023 at 6:10 AM Eugen Block wrote: > >> Hi, >> I got a customer response with payload size 4096, that made things >> even worse. The mon startup time was now around 40 minutes. My doubts >> wrt decreasing the payload size seem confirmed. Then I read Dan's >> response again which also mentions that the default payload size could >> be too small. So I asked them to double the default (2M instead of 1M) >> and am now waiting for a new result. I'm still wondering why this only >> happens when the mon is down for more than 5 minutes. Does anyone have >> an explanation for that time factor? >> Another thing they're going to do is to remove lots of snapshot >> tombstones (rbd mirroring snapshots in the trash namespace), maybe >> that will reduce the osd_snap keys in the mon db, which then would >> increase the startup time. We'll see... >> >> Zitat von Eugen Block : >> >> > Thanks, Dan! >> > >> >> Yes that sounds familiar from the luminous and mimic days. >> >> The workaround for zillions of snapshot keys at that time was to use: >> >> ceph config set mon mon_sync_max_payload_size 4096 >> > >> > I actually did search for mon_sync_max_payload_keys, not bytes so I >> > missed your thread, it seems. Thanks for pointing that out. So the >> > defaults seem to be these in Octopus: >> > >> > "mon_sync_max_payload_keys": "2000", >> > "mon_sync_max_payload_size": "1048576", >> > >> >> So it could be in your case that the sync payload is just too small to >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and >> debug_mon >> >> you should be able to understand what is taking so long, and tune >> >> mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly. >> > >> > I'm confused, if the payload size is too small, why would decreasing >> > it help? Or am I misunderstanding something? But it probably won't >> > hurt to try it with 4096 and see if anything changes. If not we can >> > still turn on debug logs and take a closer look. >> > >> >> And additional to Dan suggestion, the HDD is not a good choices for >> >> RocksDB, which is most likely the reason for this thread, I think >> >> that from the 3rd time the database just goes into compaction >> >> maintenance >> > >> > Believe me, I know... but there's not much they can currently do >> > about it, quite a long story... But I have been telling them that >> > for months now. Anyway, I will make some suggestions and report back >> > if it worked in this case as well. >> > >> > Thanks! >> > Eugen >> > >> > Zitat von Dan van der Ster : >> > >> >> Hi Eugen! >> >> >> >> Yes that sounds familiar from the luminous and mimic days. >> >> >> >> Check this old thread: >> >> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/ >> >> (that thread is truncated but I can tell you that it worked for Frank). >> >> Also the even older referenced thread: >> >> >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/ >> >> >> >> The workaround for zillions of snapshot keys at that time was to use: >> >> ceph config set mon mon_sync_max_payload_size 4096 >> >> >> >> That said, that sync issue was supposed to be fixed by way of adding the >> >> new option mon_sync_max_payload_keys, which has been around since >> nautilus. >> >> >> >> So it could be in your case that the sync payload is just too small to >> >> efficiently move 42 million osd_snap keys? Using debug_paxos and >> debug_mon >> >> you should be able to understand what is taking so long, and tune >> >> mo
[ceph-users] Re: OSD memory usage after cephadm adoption
Here you have. Perf dump: { "AsyncMessenger::Worker-0": { "msgr_recv_messages": 12239872, "msgr_send_messages": 12284221, "msgr_recv_bytes": 43759275160, "msgr_send_bytes": 61268769426, "msgr_created_connections": 754, "msgr_active_connections": 100, "msgr_running_total_time": 939.476931816, "msgr_running_send_time": 337.873686715, "msgr_running_recv_time": 360.728238752, "msgr_running_fast_dispatch_time": 183.737116872, "msgr_send_messages_queue_lat": { "avgcount": 12284206, "sum": 1538.989479364, "avgtime": 0.000125281 }, "msgr_handle_ack_lat": { "avgcount": 5258403, "sum": 1.005075918, "avgtime": 0.00191 } }, "AsyncMessenger::Worker-1": { "msgr_recv_messages": 12099771, "msgr_send_messages": 12138795, "msgr_recv_bytes": 56967534605, "msgr_send_bytes": 130548664272, "msgr_created_connections": 647, "msgr_active_connections": 91, "msgr_running_total_time": 977.277996439, "msgr_running_send_time": 362.155959231, "msgr_running_recv_time": 365.376281473, "msgr_running_fast_dispatch_time": 191.186643292, "msgr_send_messages_queue_lat": { "avgcount": 12138818, "sum": 1557.187685700, "avgtime": 0.000128281 }, "msgr_handle_ack_lat": { "avgcount": 6155265, "sum": 1.096270527, "avgtime": 0.00178 } }, "AsyncMessenger::Worker-2": { "msgr_recv_messages": 11858354, "msgr_send_messages": 11960404, "msgr_recv_bytes": 60727084610, "msgr_send_bytes": 168534726650, "msgr_created_connections": 1043, "msgr_active_connections": 103, "msgr_running_total_time": 937.324084772, "msgr_running_send_time": 351.174710644, "msgr_running_recv_time": 2744.276782474, "msgr_running_fast_dispatch_time": 172.960322050, "msgr_send_messages_queue_lat": { "avgcount": 11960392, "sum": 1763.762581924, "avgtime": 0.000147466 }, "msgr_handle_ack_lat": { "avgcount": 2651457, "sum": 0.538495450, "avgtime": 0.00203 } }, "bluefs": { "db_total_bytes": 128005955584, "db_used_bytes": 3271557120, "wal_total_bytes": 0, "wal_used_bytes": 0, "slow_total_bytes": 1810377216, "slow_used_bytes": 0, "num_files": 70, "log_bytes": 13045760, "log_compactions": 58, "logged_bytes": 922333184, "files_written_wal": 2, "files_written_sst": 13, "bytes_written_wal": 1988489216, "bytes_written_sst": 268890112, "bytes_written_slow": 0, "max_bytes_wal": 0, "max_bytes_db": 3271557120, "max_bytes_slow": 0, "read_random_count": 577484, "read_random_bytes": 2879541532, "read_random_disk_count": 284290, "read_random_disk_bytes": 1540394118, "read_random_buffer_count": 319088, "read_random_buffer_bytes": 1339147414, "read_count": 1086625, "read_bytes": 15054317429, "read_prefetch_count": 1069462, "read_prefetch_bytes": 14506469332, "read_zeros_candidate": 0, "read_zeros_errors": 0 }, "bluestore": { "kv_flush_lat": { "avgcount": 225099, "sum": 526.605165277, "avgtime": 0.002339438 }, "kv_commit_lat": { "avgcount": 225099, "sum": 61.412175620, "avgtime": 0.000272822 }, "kv_sync_lat": { "avgcount": 225099, "sum": 588.017340897, "avgtime": 0.002612261 }, "kv_final_lat": { "avgcount": 225096, "sum": 6.516869320, "avgtime": 0.28951 }, "state_prepare_lat": { "avgcount": 241063, "sum": 173.705759592, "avgtime": 0.000720582 }, "state_aio_wait_lat": { "avgcount": 241063, "sum": 1008.936150524, "avgtime": 0.004185362 }, "state_io_done_lat": { "avgcount": 241063, "sum": 2.923457351, "avgtime": 0.12127 }, "state_kv_queued_lat": { "avgcount": 241063, "sum": 560.050193021, "avgtime": 0.002323252 }, "state_kv_commiting_lat": { "avgcount": 241063, "sum": 68.355225981, "avgtime": 0.000283557 }, "state_kv_done_lat": { "avgcount": 241063, "sum": 0.097836444, "avgtime": 0.00405 }, "state_deferred_queued_lat": { "avgcount": 47230,
[ceph-users] Re: OSD memory usage after cephadm adoption
On 7/11/23 09:44, Luis Domingues wrote: "bluestore-pricache": { "target_bytes": 6713193267, "mapped_bytes": 6718742528, "unmapped_bytes": 467025920, "heap_bytes": 7185768448, "cache_bytes": 4161537138 }, Hi Luis, Looks like the mapped bytes for this OSD process is very close to (just a little over) the target bytes that has been set when you did the perf dump. There is some unmapped memory that can be reclaimed by the kernel, but we can't force the kernel to reclaim it. It could be that the kernel is being a little lazy if there isn't memory pressure. The way the memory autotuning works in Ceph is that periodically the prioritycache system will look at the mapped memory usage of the process, then grow/shrink the aggregate size of the in-memory caches to try and stay near the target. It's reactive in nature, meaning that it can't completely control for spikes. It also can't shrink the caches below a small minimum size, so if there is a memory leak it will help to an extent but can't completely fix it. Once the aggregate memory size is decided on, it goes through a process of looking at how hot the different caches are and assigns memory based on where it thinks the memory would be most useful. Again this is based on mapped memory though. It can't force the kernel to reclaim memory that has already been released. Thanks, Mark -- Best Regards, Mark Nelson Head of R&D (USA) Clyso GmbH p: +49 89 21552391 12 a: Loristraße 8 | 80335 München | Germany w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Per minor-version view on docs.ceph.com
Hi, I have a request about docs.ceph.com. Could you provide per minor-version views on docs.ceph.com? Currently, we can select the Ceph version by using `https://docs.ceph.com/en/". In this case, we can use the major version's code names (e.g., "quincy") or "latest". However, we can't use minor version numbers like "v17.2.6". It's convenient for me (and I guess for many other users, too) to be able to select the document for the version which we actually use. In my recent case, I've read the mclock's document of quincy because I use v17.2.6. However, the document has changed a lot from v17.2.6 to the quincy's latest one because of the recent mclock's rework. Thanks, Satoru ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io