[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
Hi Niklas, I am not sure why you are surprised. In a large cluster, you should expect some rebalancing on every crush map or crush map rule change. Ceph doesn't just enforce the failure domain, it also whants to have a "perfect" pseudo-random distribution across the clusters based on the crush map hierarchy and the rules you provided. Rebalancing in itself is not a problem, apart from adding some load to your cluster. As you are running Pacific, you need to tune properly your osd_max_backfills limit so that the rebalancing is fast enough and doesn't stress too much your cluster. Since Quincy there is a new scheduler that doesn't rely on these settings but tries to achieve fairness at the OSD level between the different type of loads and my experience with some large rebalancing on a cluster with 200 OSDs it that the result is pretty good. If your test cluster is small enough, it may be that the current placement is not really sensitive to the change of the failure domain as there is not enough different placement options for the placement algorithm to change something. Take it as an excpetion rather than the normal behaviour. Cheers, Michel Le 15/07/2023 à 20:02, Niklas Hambüchen a écrit : Hi Ceph users, I have a Ceph 16.2.7 cluster that so far has been replicated over the `host` failure domain. All `hosts` have been chosen to be in different `datacenter`s, so that was sufficient. Now I wish to add more hosts, including some in already-used data centers, so I'm planning to use CRUSH's `datacenter` failure domain instead. My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph decides that it should now rebalance the entire cluster. This seems unnecessary, and wrong. Before, `ceph osd tree` (some OSDs omitted for legibility): ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 440.73514 root default -3 146.43625 host node-4 2 hdd 14.61089 osd.2 up 1.0 1.0 3 hdd 14.61089 osd.3 up 1.0 1.0 -7 146.43625 host node-5 14 hdd 14.61089 osd.14 up 1.0 1.0 15 hdd 14.61089 osd.15 up 1.0 1.0 -10 146.43625 host node-6 26 hdd 14.61089 osd.26 up 1.0 1.0 27 hdd 14.61089 osd.27 up 1.0 1.0 After assigning of `datacenter` crush buckets: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 440.73514 root default -18 146.43625 datacenter FSN-DC16 -7 146.43625 host node-5 14 hdd 14.61089 osd.14 up 1.0 1.0 15 hdd 14.61089 osd.15 up 1.0 1.0 -17 146.43625 datacenter FSN-DC18 -10 146.43625 host node-6 26 hdd 14.61089 osd.26 up 1.0 1.0 27 hdd 14.61089 osd.27 up 1.0 1.0 -16 146.43625 datacenter FSN-DC4 -3 146.43625 host node-4 2 hdd 14.61089 osd.2 up 1.0 1.0 3 hdd 14.61089 osd.3 up 1.0 1.0 This shows that the tree is essentially unchanged, it just "gained a level". In `ceph status` I now get: pgs: 1167541260/1595506041 objects misplaced (73.177%) If I remove the `datacenter` level again, then the misplacement disappears. On a minimal testing cluster, this misplacement issue did not appear. Why does Ceph think that these objects are misplaced when I add the datacenter level? Is there a more correct way to do this? Thanks! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device
On 7/10/23 11:19 AM, Matthew Booth wrote: On Thu, 6 Jul 2023 at 12:54, Mark Nelson wrote: On 7/6/23 06:02, Matthew Booth wrote: On Wed, 5 Jul 2023 at 15:18, Mark Nelson wrote: I'm sort of amazed that it gave you symbols without the debuginfo packages installed. I'll need to figure out a way to prevent that. Having said that, your new traces look more accurate to me. The thing that sticks out to me is the (slight?) amount of contention on the PWL m_lock in dispatch_deferred_writes, update_root_scheduled_ops, append_ops, append_sync_point(), etc. I don't know if the contention around the m_lock is enough to cause an increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first thing that jumps out at me. There appears to be a large number of threads (each tp_pwl thread, the io_context_pool threads, the qemu thread, and the bstore_aio thread) that all appear to have potential to contend on that lock. You could try dropping the number of tp_pwl threads from 4 to 1 and see if that changes anything. Will do. Any idea how to do that? I don't see an obvious rbd config option. Thanks for looking into this, Matt you thanked me too soon...it appears to be hard-coded in, so you'll have to do a custom build. :D https://github.com/ceph/ceph/blob/main/src/librbd/cache/pwl/AbstractWriteLog.cc#L55-L56 Just to update: I have managed to test this today and it made no difference :( Sorry for the late reply, just saw I had written this email but never actually sent it. So... Nuts. I was hoping for at least a little gain if you dropped it to 1. In general, though, unless it's something egregious are we really looking for something CPU-bound? Writes are 2 orders of magnitude slower than the underlying local disk. This has to be caused by something wildly inefficient. In this case I would expect to be entirely latency bound. It didn't look like PWL was working particularly hard, but to the extent that it was doing anything, it looked like it was spending a surprising amount of time dealing with that lock. I still suspect that if your goal is to reduce 99% latency, you'll need to figure out what's causing little micro-stalls. I have had a thought: the guest filesystem has 512 byte blocks, but the pwl filesystem has 4k blocks (on a 4k disk). Given that the test is of small writes, is there any chance that we're multiplying the number of physical writes in some pathological manner? Matt -- Best Regards, Mark Nelson Head of R&D (USA) Clyso GmbH p: +49 89 21552391 12 a: Loristraße 8 | 80335 München | Germany w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
Based on my understanding of CRUSH it basically works down the hierarchy and then randomly (but deterministically for a given CRUSH map) picks buckets (based on the specific selection rule) on that level for the object and then it does this recursively until it ends up at the leaf nodes. Given that you introduced a whole hierarchy level just below the top, objects will now be distributed differently since the pseudo-random hash-based selection strategy may now for example put an object that used to be in node-4 under FSN-DC16 instead So basically when you fiddle with the hierarchy you can generally expect lots of data movement everywhere downstream of your change. On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen wrote: > Hi Ceph users, > > I have a Ceph 16.2.7 cluster that so far has been replicated over the > `host` failure domain. > All `hosts` have been chosen to be in different `datacenter`s, so that was > sufficient. > > Now I wish to add more hosts, including some in already-used data centers, > so I'm planning to use CRUSH's `datacenter` failure domain instead. > > My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph > decides that it should now rebalance the entire cluster. > This seems unnecessary, and wrong. > > Before, `ceph osd tree` (some OSDs omitted for legibility): > > > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -3 146.43625 host node-4 >2hdd 14.61089 osd.2up 1.0 > 1.0 >3hdd 14.61089 osd.3up 1.0 > 1.0 > -7 146.43625 host node-5 > 14hdd 14.61089 osd.14 up 1.0 > 1.0 > 15hdd 14.61089 osd.15 up 1.0 > 1.0 > -10 146.43625 host node-6 > 26hdd 14.61089 osd.26 up 1.0 > 1.0 > 27hdd 14.61089 osd.27 up 1.0 > 1.0 > > > After assigning of `datacenter` crush buckets: > > > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -18 146.43625 datacenter FSN-DC16 > -7 146.43625 host node-5 > 14hdd 14.61089 osd.14 up 1.0 > 1.0 > 15hdd 14.61089 osd.15 up 1.0 > 1.0 > -17 146.43625 datacenter FSN-DC18 > -10 146.43625 host node-6 > 26hdd 14.61089 osd.26 up 1.0 > 1.0 > 27hdd 14.61089 osd.27 up 1.0 > 1.0 > -16 146.43625 datacenter FSN-DC4 > -3 146.43625 host node-4 >2hdd 14.61089 osd.2up 1.0 > 1.0 >3hdd 14.61089 osd.3up 1.0 > 1.0 > > > This shows that the tree is essentially unchanged, it just "gained a > level". > > In `ceph status` I now get: > > pgs: 1167541260/1595506041 objects misplaced (73.177%) > > If I remove the `datacenter` level again, then the misplacement disappears. > > On a minimal testing cluster, this misplacement issue did not appear. > > Why does Ceph think that these objects are misplaced when I add the > datacenter level? > Is there a more correct way to do this? > > > Thanks! > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory usage after cephadm adoption
Hi, Thanks for your hints. I tries to play a little bit with the configs. And now I want to put the 0.7 value as default. So I configured ceph: mgradvanced mgr/cephadm/autotune_memory_target_ratio0.70 * osdadvanced osd_memory_target_autotune true And I ended up having this configs osd host:st10-cbosd-001 basic osd_memory_target 7219293672 osd host:st10-cbosd-002 basic osd_memory_target 7219293672 osd host:st10-cbosd-004 basic osd_memory_target 7219293672 osd host:st10-cbosd-005 basic osd_memory_target 7219293451 osd host:st10-cbosd-006 basic osd_memory_target 7219293451 osd host:st11-cbosd-007 basic osd_memory_target 7216821484 osd host:st11-cbosd-008 basic osd_memory_target 7216825454 And running a ceph orch ps gaves me: osd.0st11-cbosd-007.plabs.ch running (2d) 10m ago 10d25.8G6882M 16.2.13 327f301eff51 29a075f2f925 osd.1st10-cbosd-001.plabs.ch running (19m) 8m ago 10d2115M6884M 16.2.13 327f301eff51 df5067bde5ce osd.10 st10-cbosd-005.plabs.ch running (2d) 10m ago 10d5524M6884M 16.2.13 327f301eff51 f7bc0641ee46 osd.100 st11-cbosd-008.plabs.ch running (2d) 10m ago 10d5234M6882M 16.2.13 327f301eff51 74efa243b953 osd.101 st11-cbosd-008.plabs.ch running (2d) 10m ago 10d4741M6882M 16.2.13 327f301eff51 209671007c65 osd.102 st11-cbosd-008.plabs.ch running (2d) 10m ago 10d5174M6882M 16.2.13 327f301eff51 63691d557732 So far so good. But when I took a look on the memory usage of my OSDs, I was below of that value, by quite a bite. Looking at the OSDs themselves, I have: "bluestore-pricache": { "target_bytes": 4294967296, "mapped_bytes": 1343455232, "unmapped_bytes": 16973824, "heap_bytes": 1360429056, "cache_bytes": 2845415832 }, And if I get the running config: "osd_memory_target": "4294967296", "osd_memory_target_autotune": "true", "osd_memory_target_cgroup_limit_ratio": "0.80", Which is not the value I observe from the config. I have 4294967296 instead of something around 7219293672. Did I miss something? Luis Domingues Proton AG --- Original Message --- On Tuesday, July 11th, 2023 at 18:10, Mark Nelson wrote: > On 7/11/23 09:44, Luis Domingues wrote: > > > "bluestore-pricache": { > > "target_bytes": 6713193267, > > "mapped_bytes": 6718742528, > > "unmapped_bytes": 467025920, > > "heap_bytes": 7185768448, > > "cache_bytes": 4161537138 > > }, > > > Hi Luis, > > > Looks like the mapped bytes for this OSD process is very close to (just > a little over) the target bytes that has been set when you did the perf > dump. There is some unmapped memory that can be reclaimed by the kernel, > but we can't force the kernel to reclaim it. It could be that the > kernel is being a little lazy if there isn't memory pressure. > > The way the memory autotuning works in Ceph is that periodically the > prioritycache system will look at the mapped memory usage of the > process, then grow/shrink the aggregate size of the in-memory caches to > try and stay near the target. It's reactive in nature, meaning that it > can't completely control for spikes. It also can't shrink the caches > below a small minimum size, so if there is a memory leak it will help to > an e
[ceph-users] Re: OSD memory usage after cephadm adoption
Hello Luis, Please see my response below: But when I took a look on the memory usage of my OSDs, I was below of that > value, by quite a bite. Looking at the OSDs themselves, I have: > > "bluestore-pricache": { > "target_bytes": 4294967296, > "mapped_bytes": 1343455232, > "unmapped_bytes": 16973824, > "heap_bytes": 1360429056, > "cache_bytes": 2845415832 > }, > > And if I get the running config: > "osd_memory_target": "4294967296", > "osd_memory_target_autotune": "true", > "osd_memory_target_cgroup_limit_ratio": "0.80", > > Which is not the value I observe from the config. I have 4294967296 > instead of something around 7219293672. Did I miss something? > > This is very likely due to https://tracker.ceph.com/issues/48750. The fix was recently merged into the main branch and should be backported soon all the way to pacific. Until then, the workaround would be to set the osd_memory_target on each OSD individually to the desired value. -Sridhar ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io