[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-16 Thread Michel Jouvin

Hi Niklas,

I am not sure why you are surprised. In a large cluster, you should 
expect some rebalancing on every crush map or crush map rule change. 
Ceph doesn't just enforce the failure domain, it also whants to have a 
"perfect" pseudo-random distribution across the clusters based on the 
crush map hierarchy and the rules you provided. Rebalancing in itself is 
not a problem, apart from adding some load to your cluster. As you are 
running Pacific, you need to tune properly your osd_max_backfills limit 
so that the rebalancing is fast enough and doesn't stress too much your 
cluster. Since Quincy there is a new scheduler that doesn't rely on 
these settings but tries to achieve fairness at the OSD level between 
the different type of loads and my experience with some large 
rebalancing on a cluster with 200 OSDs it that the result is pretty good.


If your test cluster is small enough, it may be that the current 
placement is not really sensitive to the change of the failure domain as 
there is not enough different placement options for the placement 
algorithm to change something. Take it as an excpetion rather than the 
normal behaviour.


Cheers,

Michel

Le 15/07/2023 à 20:02, Niklas Hambüchen a écrit :

Hi Ceph users,

I have a Ceph 16.2.7 cluster that so far has been replicated over the 
`host` failure domain.
All `hosts` have been chosen to be in different `datacenter`s, so that 
was sufficient.


Now I wish to add more hosts, including some in already-used data 
centers, so I'm planning to use CRUSH's `datacenter` failure domain 
instead.


My problem is that when I add the `datacenter`s into the CRUSH tree, 
Ceph decides that it should now rebalance the entire cluster.

This seems unnecessary, and wrong.

Before, `ceph osd tree` (some OSDs omitted for legibility):


    ID   CLASS  WEIGHT TYPE NAME    STATUS REWEIGHT  
PRI-AFF

 -1 440.73514  root default
 -3 146.43625  host node-4
  2    hdd   14.61089  osd.2    up 1.0  
1.0
  3    hdd   14.61089  osd.3    up 1.0  
1.0

 -7 146.43625  host node-5
 14    hdd   14.61089  osd.14   up 1.0  
1.0
 15    hdd   14.61089  osd.15   up 1.0  
1.0

    -10 146.43625  host node-6
 26    hdd   14.61089  osd.26   up 1.0  
1.0
 27    hdd   14.61089  osd.27   up 1.0  
1.0



After assigning of `datacenter` crush buckets:

        ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
 -1 440.73514  root default
    -18 146.43625  datacenter FSN-DC16
 -7 146.43625  host node-5
 14    hdd   14.61089  osd.14   up 
1.0  1.0
 15    hdd   14.61089  osd.15   up 
1.0  1.0

    -17 146.43625  datacenter FSN-DC18
    -10 146.43625  host node-6
 26    hdd   14.61089  osd.26   up 
1.0  1.0
 27    hdd   14.61089  osd.27   up 
1.0  1.0

    -16 146.43625  datacenter FSN-DC4
 -3 146.43625  host node-4
  2    hdd   14.61089  osd.2    up 
1.0  1.0
  3    hdd   14.61089  osd.3    up 
1.0  1.0



This shows that the tree is essentially unchanged, it just "gained a 
level".


In `ceph status` I now get:

    pgs: 1167541260/1595506041 objects misplaced (73.177%)

If I remove the `datacenter` level again, then the misplacement 
disappears.


On a minimal testing cluster, this misplacement issue did not appear.

Why does Ceph think that these objects are misplaced when I add the 
datacenter level?

Is there a more correct way to do this?


Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-07-16 Thread Mark Nelson

On 7/10/23 11:19 AM, Matthew Booth wrote:


On Thu, 6 Jul 2023 at 12:54, Mark Nelson  wrote:


On 7/6/23 06:02, Matthew Booth wrote:

On Wed, 5 Jul 2023 at 15:18, Mark Nelson  wrote:

I'm sort of amazed that it gave you symbols without the debuginfo
packages installed.  I'll need to figure out a way to prevent that.
Having said that, your new traces look more accurate to me.  The thing
that sticks out to me is the (slight?) amount of contention on the PWL
m_lock in dispatch_deferred_writes, update_root_scheduled_ops,
append_ops, append_sync_point(), etc.

I don't know if the contention around the m_lock is enough to cause an
increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first
thing that jumps out at me.  There appears to be a large number of
threads (each tp_pwl thread, the io_context_pool threads, the qemu
thread, and the bstore_aio thread) that all appear to have potential to
contend on that lock.  You could try dropping the number of tp_pwl
threads from 4 to 1 and see if that changes anything.

Will do. Any idea how to do that? I don't see an obvious rbd config option.

Thanks for looking into this,
Matt

you thanked me too soon...it appears to be hard-coded in, so you'll have
to do a custom build. :D

https://github.com/ceph/ceph/blob/main/src/librbd/cache/pwl/AbstractWriteLog.cc#L55-L56

Just to update: I have managed to test this today and it made no difference :(



Sorry for the late reply, just saw I had written this email but never 
actually sent it.


So... Nuts.  I was hoping for at least a little gain if you dropped it to 1.



In general, though, unless it's something egregious are we really
looking for something CPU-bound? Writes are 2 orders of magnitude
slower than the underlying local disk. This has to be caused by
something wildly inefficient.



In this case I would expect to be entirely latency bound.  It didn't 
look like PWL was working particularly hard, but to the extent that it 
was doing anything, it looked like it was spending a surprising amount 
of time dealing with that lock.  I still suspect that if your goal is to 
reduce 99% latency, you'll need to figure out what's causing little 
micro-stalls.





I have had a thought: the guest filesystem has 512 byte blocks, but
the pwl filesystem has 4k blocks (on a 4k disk). Given that the test
is of small writes, is there any chance that we're multiplying the
number of physical writes in some pathological manner?

Matt


--
Best Regards,
Mark Nelson
Head of R&D (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing

2023-07-16 Thread Christian Wuerdig
Based on my understanding of CRUSH it basically works down the hierarchy
and then randomly (but deterministically for a given CRUSH map) picks
buckets (based on the specific selection rule) on that level for the object
and then it does this recursively until it ends up at the leaf nodes.
Given that you introduced a whole hierarchy level just below the top,
objects will now be distributed differently since the pseudo-random
hash-based selection strategy may now for example put an object that used
to be in node-4 under FSN-DC16 instead
So basically when you fiddle with the hierarchy you can generally expect
lots of data movement everywhere downstream of your change.

On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen  wrote:

> Hi Ceph users,
>
> I have a Ceph 16.2.7 cluster that so far has been replicated over the
> `host` failure domain.
> All `hosts` have been chosen to be in different `datacenter`s, so that was
> sufficient.
>
> Now I wish to add more hosts, including some in already-used data centers,
> so I'm planning to use CRUSH's `datacenter` failure domain instead.
>
> My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph
> decides that it should now rebalance the entire cluster.
> This seems unnecessary, and wrong.
>
> Before, `ceph osd tree` (some OSDs omitted for legibility):
>
>
>  ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> PRI-AFF
>   -1 440.73514  root default
>   -3 146.43625  host node-4
>2hdd   14.61089  osd.2up   1.0
> 1.0
>3hdd   14.61089  osd.3up   1.0
> 1.0
>   -7 146.43625  host node-5
>   14hdd   14.61089  osd.14   up   1.0
> 1.0
>   15hdd   14.61089  osd.15   up   1.0
> 1.0
>  -10 146.43625  host node-6
>   26hdd   14.61089  osd.26   up   1.0
> 1.0
>   27hdd   14.61089  osd.27   up   1.0
> 1.0
>
>
> After assigning of `datacenter` crush buckets:
>
>
>  ID   CLASS  WEIGHT TYPE NAMESTATUS  REWEIGHT
> PRI-AFF
>   -1 440.73514  root default
>  -18 146.43625  datacenter FSN-DC16
>   -7 146.43625  host node-5
>   14hdd   14.61089  osd.14   up   1.0
> 1.0
>   15hdd   14.61089  osd.15   up   1.0
> 1.0
>  -17 146.43625  datacenter FSN-DC18
>  -10 146.43625  host node-6
>   26hdd   14.61089  osd.26   up   1.0
> 1.0
>   27hdd   14.61089  osd.27   up   1.0
> 1.0
>  -16 146.43625  datacenter FSN-DC4
>   -3 146.43625  host node-4
>2hdd   14.61089  osd.2up   1.0
> 1.0
>3hdd   14.61089  osd.3up   1.0
> 1.0
>
>
> This shows that the tree is essentially unchanged, it just "gained a
> level".
>
> In `ceph status` I now get:
>
>  pgs: 1167541260/1595506041 objects misplaced (73.177%)
>
> If I remove the `datacenter` level again, then the misplacement disappears.
>
> On a minimal testing cluster, this misplacement issue did not appear.
>
> Why does Ceph think that these objects are misplaced when I add the
> datacenter level?
> Is there a more correct way to do this?
>
>
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-16 Thread Luis Domingues
Hi,

Thanks for your hints. I tries to play a little bit with the configs. And now I 
want to put the 0.7 value as default.

So I configured ceph:

  mgradvanced  
mgr/cephadm/autotune_memory_target_ratio0.70

* 
  osdadvanced  
osd_memory_target_autotune  true

  

And I ended up having this configs

  osd   host:st10-cbosd-001  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-002  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-004  basic 
osd_memory_target   7219293672  

  
  osd   host:st10-cbosd-005  basic 
osd_memory_target   7219293451  

  
  osd   host:st10-cbosd-006  basic 
osd_memory_target   7219293451  

  
  osd   host:st11-cbosd-007  basic 
osd_memory_target   7216821484  

  
  osd   host:st11-cbosd-008  basic 
osd_memory_target   7216825454 

And running a ceph orch ps gaves me:

osd.0st11-cbosd-007.plabs.ch  running (2d) 
10m ago  10d25.8G6882M  16.2.13  327f301eff51  29a075f2f925  
osd.1st10-cbosd-001.plabs.ch  running (19m) 
8m ago  10d2115M6884M  16.2.13  327f301eff51  df5067bde5ce  
osd.10   st10-cbosd-005.plabs.ch  running (2d) 
10m ago  10d5524M6884M  16.2.13  327f301eff51  f7bc0641ee46  
osd.100  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d5234M6882M  16.2.13  327f301eff51  74efa243b953  
osd.101  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d4741M6882M  16.2.13  327f301eff51  209671007c65  
osd.102  st11-cbosd-008.plabs.ch  running (2d) 
10m ago  10d5174M6882M  16.2.13  327f301eff51  63691d557732

So far so good.

But when I took a look on the memory usage of my OSDs, I was below of that 
value, by quite a bite. Looking at the OSDs themselves, I have:

"bluestore-pricache": {
"target_bytes": 4294967296,
"mapped_bytes": 1343455232,
"unmapped_bytes": 16973824,
"heap_bytes": 1360429056,
"cache_bytes": 2845415832
},

And if I get the running config:
"osd_memory_target": "4294967296",
"osd_memory_target_autotune": "true",
"osd_memory_target_cgroup_limit_ratio": "0.80",

Which is not the value I observe from the config. I have 4294967296 instead of 
something around 7219293672. Did I miss something?

Luis Domingues
Proton AG


--- Original Message ---
On Tuesday, July 11th, 2023 at 18:10, Mark Nelson  wrote:


> On 7/11/23 09:44, Luis Domingues wrote:
> 
> > "bluestore-pricache": {
> > "target_bytes": 6713193267,
> > "mapped_bytes": 6718742528,
> > "unmapped_bytes": 467025920,
> > "heap_bytes": 7185768448,
> > "cache_bytes": 4161537138
> > },
> 
> 
> Hi Luis,
> 
> 
> Looks like the mapped bytes for this OSD process is very close to (just
> a little over) the target bytes that has been set when you did the perf
> dump. There is some unmapped memory that can be reclaimed by the kernel,
> but we can't force the kernel to reclaim it. It could be that the
> kernel is being a little lazy if there isn't memory pressure.
> 
> The way the memory autotuning works in Ceph is that periodically the
> prioritycache system will look at the mapped memory usage of the
> process, then grow/shrink the aggregate size of the in-memory caches to
> try and stay near the target. It's reactive in nature, meaning that it
> can't completely control for spikes. It also can't shrink the caches
> below a small minimum size, so if there is a memory leak it will help to
> an e

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-16 Thread Sridhar Seshasayee
Hello Luis,

Please see my response below:

But when I took a look on the memory usage of my OSDs, I was below of that
> value, by quite a bite. Looking at the OSDs themselves, I have:
>
> "bluestore-pricache": {
> "target_bytes": 4294967296,
> "mapped_bytes": 1343455232,
> "unmapped_bytes": 16973824,
> "heap_bytes": 1360429056,
> "cache_bytes": 2845415832
> },
>
> And if I get the running config:
> "osd_memory_target": "4294967296",
> "osd_memory_target_autotune": "true",
> "osd_memory_target_cgroup_limit_ratio": "0.80",
>
> Which is not the value I observe from the config. I have 4294967296
> instead of something around 7219293672. Did I miss something?
>
>
This is very likely due to https://tracker.ceph.com/issues/48750. The fix
was recently merged into
the main branch and should be backported soon all the way to pacific.

Until then, the workaround would be to set the osd_memory_target on each
OSD individually to
the desired value.

-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io