[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Marc Roos
 

I would not call a ceph page, a random tuning tip. At least I hope they 
are not. NVMe-only with 100Gbit is not really a standard setup. I assume 
with such setup you have the luxury to not notice many optimizations. 

What I mostly read is that changing to mtu 9000 will allow you to better 
saturate the 10Gbit adapter, and I expect this to show on a low end busy 
cluster. Don't you have any test results of such a setup?




-Original Message-

Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 
working after setting MTU 9000

Don't optimize stuff without benchmarking *before and after*, don't 
apply random tuning tipps from the Internet without benchmarking them.

My experience with Jumbo frames: 3% performance. On a NVMe-only setup 
with 100 Gbit/s network.

Paul


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, May 26, 2020 at 7:02 PM Marc Roos  
wrote:




Look what I have found!!! :)
https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/ 



-Original Message-
From: Anthony D'Atri [mailto:anthony.da...@gmail.com] 
Sent: maandag 25 mei 2020 22:12
To: Marc Roos
Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; doustar
Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 

working after setting MTU 9000

Quick and easy depends on your network infrastructure.  Sometimes 
it is 
difficult or impossible to retrofit a live cluster without 
disruption.   


> On May 25, 2020, at 1:03 AM, Marc Roos  

wrote:
> 
> 
> I am interested. I am always setting mtu to 9000. To be honest I 
> cannot imagine there is no optimization since you have less 
interrupt 
> requests, and you are able x times as much data. Every time there 

> something written about optimizing the first thing mention is 
changing 

> to the mtu 9000. Because it is quick and easy win.
> 
> 
> 
> 
> -Original Message-
> From: Dave Hall [mailto:kdh...@binghamton.edu]
> Sent: maandag 25 mei 2020 5:11
> To: Martin Verges; Suresh Rama
> Cc: Amudhan P; Khodayar Doustar; ceph-users
> Subject: [ceph-users] Re: [External Email] Re: Ceph Nautius not 
> working after setting MTU 9000
> 
> All,
> 
> Regarding Martin's observations about Jumbo Frames
> 
> I have recently been gathering some notes from various internet 
> sources regarding Linux network performance, and Linux 
performance in 
> general, to be applied to a Ceph cluster I manage but also to the 
rest 

> of the Linux server farm I'm responsible for.
> 
> In short, enabling Jumbo Frames without also tuning a number of 
other 
> kernel and NIC attributes will not provide the performance 
increases 
> we'd like to see.  I have not yet had a chance to go through the 
rest 
> of the testing I'd like to do, but  I can confirm (via iperf3) 
that 
> only enabling Jumbo Frames didn't make a significant difference.
> 
> Some of the other attributes I'm referring to are incoming and 
> outgoing buffer sizes at the NIC, IP, and TCP levels, interrupt 
> coalescing, NIC offload functions that should or shouldn't be 
turned 
> on, packet queuing disciplines (tc), the best choice of TCP 
slow-start 

> algorithms, and other TCP features and attributes.
> 
> The most off-beat item I saw was something about adding IPTABLES 
rules 

> to bypass CONNTRACK table lookups.
> 
> In order to do anything meaningful to assess the effect of all of 

> these settings I'd like to figure out how to set them all via 
Ansible 
> - so more to learn before I can give opinions.
> 
> -->  If anybody has added this type of configuration to Ceph 
Ansible,
> I'd be glad for some pointers.
> 
> I have started to compile a document containing my notes.  It's 
rough, 

> but I'd be glad to share if anybody is interested.
> 
> -Dave
> 
> Dave Hall
> Binghamton University
> 
>> On 5/24/2020 12:29 PM, Martin Verges wrote:
>> 
>> Just save yourself the trouble. You won't have any real benefit 
from
> MTU
>> 9000. It has some smallish, but it is not worth the effort, 
problems,
> and
>> loss of reliability for most environments.
>> Try it yourself and do some benchmarks, especially with your 
regular 
>> workload on the cluste

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Chris Palmer
To elaborate on some aspects that have been mentioned already and add 
some others::


 * Test using iperf3.
 * Don't try to use jumbos on networks where you don't have complete
   control over every host. This usually includes the main ceph
   network. It's just too much grief. You can consider using it for
   limited-access networks (e.g. ceph cluster network, hypervisor
   migration network, etc) where you know every switch & host is tuned
   correctly. (This works even when those nets share a vlan trunk with
   non-jumbo vlans - just set the max value on the trunk itself, and
   individual values on each vlan.)
 * If you are pinging make sure it doesn't fragment otherwise you will
   get misleading results: e.g. ping -M do -s 9000 x.x.x.x
 * Do not assume that 9000 is the best value. It depends on your NICs,
   your switch, kernel/device parameters, etc. Try different values
   (using iperf3). As an example the results below are using a small
   cheap Mikrotek 10G switch and HPE 10G NICs. It highlights how in
   this configuration 9000 is worse than 1500, but that 5139 is optimal
   yet 5140 is worst. The same pattern (obviously with different
   values) was apparent when multiple tests were run concurrently.
   Always test your own network in a controlled manner. And of course
   if you introduce anything different later on, test again. With
   enterprise-grade kit this might not be so common, but always test if
   you fiddle.

MTU  Gbps  (actual data transfer values using iperf3)  - one particular 
configuration only


9600 8.91 (max value)
9000 8.91
8000 8.91
7000 8.91
6000 8.91
5500 8.17
5200 7.71
5150 7.64
5140 7.62
5139 9.81 (optimal)
5138 9.81
5137 9.81
5135 9.81
5130 9.81
5120 9.81
5100 9.81
5000 9.81
4000 9.76
3000 9.68
2000 9.28
1500 9.37 (default)

Whether any of this will make a tangible difference for ceph is moot. I 
just spend a little time getting the network stack correct as above, 
then leave it. That way I know I am probably getting some benefit, and 
not doing any harm. If you blindly change things you may well do harm 
that can manifest itself in all sorts of ways outside of Ceph. Getting 
some test results for this using Ceph will be easy; getting MEANINGFUL 
results that way will be hard.


Chris

On 27/05/2020 09:25, Marc Roos wrote:
  


I would not call a ceph page, a random tuning tip. At least I hope they
are not. NVMe-only with 100Gbit is not really a standard setup. I assume
with such setup you have the luxury to not notice many optimizations.

What I mostly read is that changing to mtu 9000 will allow you to better
saturate the 10Gbit adapter, and I expect this to show on a low end busy
cluster. Don't you have any test results of such a setup?




-Original Message-

Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not
working after setting MTU 9000

Don't optimize stuff without benchmarking *before and after*, don't
apply random tuning tipps from the Internet without benchmarking them.

My experience with Jumbo frames: 3% performance. On a NVMe-only setup
with 100 Gbit/s network.

Paul


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, May 26, 2020 at 7:02 PM Marc Roos 
wrote:




Look what I have found!!! :)
https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/



-Original Message-
From: Anthony D'Atri [mailto:anthony.da...@gmail.com]
Sent: maandag 25 mei 2020 22:12
To: Marc Roos
Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; doustar
Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not

working after setting MTU 9000

Quick and easy depends on your network infrastructure.  Sometimes
it is
difficult or impossible to retrofit a live cluster without
disruption.


> On May 25, 2020, at 1:03 AM, Marc Roos 

wrote:
>
> 
> I am interested. I am always setting mtu to 9000. To be honest I
> cannot imagine there is no optimization since you have less
interrupt
> requests, and you are able x times as much data. Every time there

> something written about optimizing the first thing mention is
changing

> to the mtu 9000. Because it is quick and easy win.
>
>
>
>
> -Original Message-
> From: Dave Hall [mailto:kdh...@binghamton.edu]
> Sent: maandag 25 mei 2020 5:11
> To: Martin Verges; Suresh Rama
> Cc: Amudhan P; Khodayar Doustar; ceph-users
> Subject: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> working after setting MTU 9000
>
> All,
>
> Regarding Martin's observations about Jumbo Frames
>
> I have recently been gathering 

[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Daniel Aberger - Profihost AG
Hi,

(un)fortunately I can't test it because I managed to repair the pg.

snaptrim and snaptrim_wait have been a part of this particular pg's
status. As I was trying to look deeper into the case I had a watch on
ceph health detail and noticed that snaptrim/snaptrim_wait was suddenly
not a part of the status anymore.

So I gave it another try with ceph pg repair 18.19a and suddenly the
pg's status changed to active+clean+inconsistent+repair. It repaired
successfully.

Is snaptrim somehow blocking repair instructions? I would have thought
that repair instructions will be queued up until they can be performed
but it does not seem to work as I expected it to.

Anyway I'll keep your script in mind and give it a shot if it happens
again. Thank you :)

Daniel

Am 25.05.20 um 17:40 schrieb Dan van der Ster:
> Hi,
> 
> Does this help?
> 
> https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh
> 
> Cheers, Dan
> 
> On Mon, May 25, 2020 at 5:18 PM Daniel Aberger - Profihost AG
>  wrote:
>>
>> Hello,
>>
>> we are currently experiencing problems with ceph pg repair not working
>> on Ceph Nautilus 14.2.8.
>>
>> ceph health detail is showing us an inconsistent pg:
>>
>> [ax- ~]# ceph health detail
>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting
>> [21,15,39,18,0,9]
>>
>> when we try to repair it, nothing happens.
>>
>> [ax- ~]# ceph pg repair 18.19a
>> instructing pg 18.19as0 on osd.21 to repair
>>
>> There are no new entries in OSD 21's log file.
>>
>> We have no trouble repairing pgs in our other clusters so I assume it
>> might have to be something related to this cluster using Erasure
>> Codings. But this is just a wild guess.
>>
>> I found a similar problem in this mailing list -
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html
>>
>> Unfortunately the solution of waiting more than a week until it fixes
>> itself isn't quite satisfying.
>>
>> Is there anyone who has had similar issues and knows how to repair these
>> inconsistent pgs or what is causing the delay?
>>
>>
>> --
>> Mit freundlichen Grüßen
>>   Daniel Aberger
>> Ihr Profihost Team
>>
>> ---
>> Profihost AG
>> Expo Plaza 1
>> 30539 Hannover
>> Deutschland
>>
>> Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
>> URL: http://www.profihost.com | E-Mail: i...@profihost.com
>>
>> Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
>> Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
>> Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
>> Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Marc Roos
 
Interesting table. I have this on a production cluster 10gbit at a 
datacenter (obviously doing not that much). 


[@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
Connecting to host 10.0.0.13, port 5201
[  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec0690 KBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.08 MBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.21 MBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.21 MBytes
[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec0 
sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec  
receiver


-Original Message-
Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 
working after setting MTU 9000

To elaborate on some aspects that have been mentioned already and add 
some others::


*   Test using iperf3. 

*   Don't try to use jumbos on networks where you don't have complete 
control over every host. This usually includes the main ceph network. 
It's just too much grief. You can consider using it for limited-access 
networks (e.g. ceph cluster network, hypervisor migration network, etc) 
where you know every switch & host is tuned correctly. (This works even 
when those nets share a vlan trunk with non-jumbo vlans - just set the 
max value on the trunk itself, and individual values on each vlan.)

*   If you are pinging make sure it doesn't fragment otherwise you 
will get misleading results: e.g. ping -M do -s 9000 x.x.x.x
*   Do not assume that 9000 is the best value. It depends on your 
NICs, your switch, kernel/device parameters, etc. Try different values 
(using iperf3). As an example the results below are using a small cheap 
Mikrotek 10G switch and HPE 10G NICs. It highlights how in this 
configuration 9000 is worse than 1500, but that 5139 is optimal yet 5140 
is worst. The same pattern (obviously with different values) was 
apparent when multiple tests were run concurrently. Always test your own 
network in a controlled manner. And of course if you introduce anything 
different later on, test again. With enterprise-grade kit this might not 
be so common, but always test if you fiddle.


MTU  Gbps  (actual data transfer values using iperf3)  - one particular 
configuration only

9600 8.91 (max value)
9000 8.91
8000 8.91
7000 8.91
6000 8.91
5500 8.17
5200 7.71
5150 7.64
5140 7.62
5139 9.81 (optimal)
5138 9.81
5137 9.81
5135 9.81
5130 9.81
5120 9.81
5100 9.81
5000 9.81
4000 9.76
3000 9.68
2000 9.28
1500 9.37 (default)


Whether any of this will make a tangible difference for ceph is moot. I 
just spend a little time getting the network stack correct as above, 
then leave it. That way I know I am probably getting some benefit, and 
not doing any harm. If you blindly change things you may well do harm 
that can manifest itself in all sorts of ways outside of Ceph. Getting 
some test results for this using Ceph will be easy; getting MEANINGFUL 
results that way will be hard.


Chris


On 27/05/2020 09:25, Marc Roos wrote:


 

I would not call a ceph page, a random tuning tip. At least I hope 
they 
are not. NVMe-only with 100Gbit is not really a standard setup. I 
assume 
with such setup you have the luxury to not notice many 
optimizations. 

What I mostly read is that changing to mtu 9000 will allow you to 
better 
saturate the 10Gbit adapter, and I expect this to show on a low end 
busy 
cluster. Don't you have any test results of such a setup?




-Original Message-

Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 

working after setting MTU 9000

Don't optimize stuff without benchmarking *before and after*, don't 

apply random tuning tipps from the Internet without benchmarking 
them.

My experience with Jumbo frames: 3% performance. On a NVMe-only 
setup 
with 100 Gbit/s network.

Paul


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread EDH - Manuel Rios
Anyone can share their table with other MTU values?

Also interested into Switch CPU load

KR,
Manuel

-Mensaje original-
De: Marc Roos  
Enviado el: miércoles, 27 de mayo de 2020 12:01
Para: chris.palmer ; paul.emmerich 

CC: amudhan83 ; anthony.datri ; 
ceph-users ; doustar ; kdhall 
; sstkadu 
Asunto: [ceph-users] Re: [External Email] Re: Ceph Nautius not working after 
setting MTU 9000

 
Interesting table. I have this on a production cluster 10gbit at a 
datacenter (obviously doing not that much). 


[@]# iperf3 -c 10.0.0.13 -P 1 -M 9000
Connecting to host 10.0.0.13, port 5201
[  4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201
[ ID] Interval   Transfer Bandwidth   Retr  Cwnd
[  4]   0.00-1.00   sec  1.14 GBytes  9.77 Gbits/sec0690 KBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.08 MBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.08 MBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec0   1.21 MBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.88 Gbits/sec0   1.21 MBytes
[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.89 Gbits/sec0   1.21 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bandwidth   Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec0 
sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.87 Gbits/sec  
receiver


-Original Message-
Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 
working after setting MTU 9000

To elaborate on some aspects that have been mentioned already and add 
some others::


*   Test using iperf3. 

*   Don't try to use jumbos on networks where you don't have complete 
control over every host. This usually includes the main ceph network. 
It's just too much grief. You can consider using it for limited-access 
networks (e.g. ceph cluster network, hypervisor migration network, etc) 
where you know every switch & host is tuned correctly. (This works even 
when those nets share a vlan trunk with non-jumbo vlans - just set the 
max value on the trunk itself, and individual values on each vlan.)

*   If you are pinging make sure it doesn't fragment otherwise you 
will get misleading results: e.g. ping -M do -s 9000 x.x.x.x
*   Do not assume that 9000 is the best value. It depends on your 
NICs, your switch, kernel/device parameters, etc. Try different values 
(using iperf3). As an example the results below are using a small cheap 
Mikrotek 10G switch and HPE 10G NICs. It highlights how in this 
configuration 9000 is worse than 1500, but that 5139 is optimal yet 5140 
is worst. The same pattern (obviously with different values) was 
apparent when multiple tests were run concurrently. Always test your own 
network in a controlled manner. And of course if you introduce anything 
different later on, test again. With enterprise-grade kit this might not 
be so common, but always test if you fiddle.


MTU  Gbps  (actual data transfer values using iperf3)  - one particular 
configuration only

9600 8.91 (max value)
9000 8.91
8000 8.91
7000 8.91
6000 8.91
5500 8.17
5200 7.71
5150 7.64
5140 7.62
5139 9.81 (optimal)
5138 9.81
5137 9.81
5135 9.81
5130 9.81
5120 9.81
5100 9.81
5000 9.81
4000 9.76
3000 9.68
2000 9.28
1500 9.37 (default)


Whether any of this will make a tangible difference for ceph is moot. I 
just spend a little time getting the network stack correct as above, 
then leave it. That way I know I am probably getting some benefit, and 
not doing any harm. If you blindly change things you may well do harm 
that can manifest itself in all sorts of ways outside of Ceph. Getting 
some test results for this using Ceph will be easy; getting MEANINGFUL 
results that way will be hard.


Chris


On 27/05/2020 09:25, Marc Roos wrote:


 

I would not call a ceph page, a random tuning tip. At least I hope 
they 
are not. NVMe-only with 100Gbit is not really a standard setup. I 
assume 
with such setup you have the luxury to not notice many 
optimizations. 

What I mostly read is that changing to mtu 9000 will allow you to 
better 
saturate the 10Gbit adapter, and I expect this to show on a low end 
busy 
cluster. Don't you have any test results of such a setup?




-Original Message-

Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not 

working after setting MTU 9000

Don't optimize stuff without benchmarking *before and after*, don't 

apply random tuning tipps from the Interne

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-27 Thread thoralf schulze
hi there -

On 5/19/20 3:11 PM, thoralf schulze wrote:

> […] and report back … 

i tried to reproduce the issue with osds each using 37gb of ssd storage
for db and wal. everything went fine - so yes, spillovers are to be avoided.

thank you very much & with kind regards,
thoralf.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] High latency spikes under jewel

2020-05-27 Thread Bence Szabo
Hi,
We experienced random and relative high latency spikes (around 0.5-10 sec)
in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s.
One osd built with one spinning disk and two nvme device.
We use a bcache device for osd back end (mixed with hdd and an nvme
partition as caching device) and one nvme partition for journal.
This synthetic command can be use for check io and latency:
rados bench -p rbd 10 write -b 4000 -t 64
With this parameters we often got about 1.5 sec or higher for maximum
latency.
We cannot decide if our cluster is misconfigured or just this is a natural
ceph behavior.
Any help, suggestion would be appreciated.
Regards,
Bence

-- 
--Szabo Bence
--
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 15.2.2 bluestore issue

2020-05-27 Thread Paul Emmerich
Hi,

since this bug may lead to data loss when several OSDs crash at the same
time (e.g., after a power outage): can we pull the release from the mirrors
and docker hub?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 20, 2020 at 7:18 PM Josh Durgin  wrote:

> Hi folks, at this time we recommend pausing OSD upgrades to 15.2.2.
>
> There have been a couple reports of OSDs crashing due to rocksdb
> corruption after upgrading to 15.2.2 [1] [2]. It's safe to upgrade
> monitors and mgr, but OSDs and everything else should wait.
>
> We're investigating and will get a fix out as soon as we can. You
> can follow progress on this tracker:
>
>https://tracker.ceph.com/issues/45613
>
> Josh
>
> [1]
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CX5PRFGL6UBFMOJC6CLUMLPMT4B2CXVQ/
> [2]
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CWN7BNPGSRBKZHUF2D7MDXCOAE3U2ERU/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High latency spikes under jewel

2020-05-27 Thread Paul Emmerich
Common problem for FileStore and really no point in debugging this: upgrade
everything to a recent version and migrate to BlueStore.
99% of random latency spikes are just fixed by doing that.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, May 27, 2020 at 3:26 PM Bence Szabo  wrote:

> Hi,
> We experienced random and relative high latency spikes (around 0.5-10 sec)
> in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s.
> One osd built with one spinning disk and two nvme device.
> We use a bcache device for osd back end (mixed with hdd and an nvme
> partition as caching device) and one nvme partition for journal.
> This synthetic command can be use for check io and latency:
> rados bench -p rbd 10 write -b 4000 -t 64
> With this parameters we often got about 1.5 sec or higher for maximum
> latency.
> We cannot decide if our cluster is misconfigured or just this is a natural
> ceph behavior.
> Any help, suggestion would be appreciated.
> Regards,
> Bence
>
> --
> --Szabo Bence
> --
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Luminous, OSDs down: "osd init failed" and "failed to load OSD map for epoch ... got 0 bytes"

2020-05-27 Thread Fulvio Galeazzi

Hallo Dan, all.

My attempt with ceph-bluestore-tool did not lead to a working OSD.
So I decided to re-create all OSDs, as they were quite many and my 
cluster was rather unbalanced.
Too bad I could not get any insight as to what caused the issue on the 
OSDs for object storage: however, I will update to Nautilus in 1 month 
or so, so I decided to consider it as "history".


  Thanks again Dan for your help!

Fulvio

Il 5/22/2020 10:43 PM, Fulvio Galeazzi ha scritto:

Hallo Dan, thanks for your patience!

Il 5/22/2020 1:57 PM, Dan van der Ster ha scritto:

The procedure to overwrite a corrupted osdmap on a given osd is
described at 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036592.html 


I wouldn't do that type of low level manipulation just yet -- better
to understand the root cause of he corruptions first before
potentially making things worse.


Ok.


There is also this issue which seems related:
https://tracker.ceph.com/issues/24423
(It has a fix in mimic and nautilus).

Could you share some more logs e.g. with the full backtrace from the
time they first crashed, and now failing to start. And maybe
/var/log/messages shows crc mismatches?


Please find at https://pastebin.com/UjfmPT37 the first occurrence of 
problem on OSD 72, and at https://pastebin.com/paumyAZ1
what happens when I start ceph-osd@72 after leaving it stopped for one 
minute (the same open_db failure is the only thing I find in 
/var/log/messages).


On a different OSD (63), same kind as 72, I see in /var/log/messages:

May 21 13:20:40 r1srv07 ceph-osd: 2020-05-21 13:20:40.087529 
7fa90a000700 -1 bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad 
crc32c/0x1000 checksum at blob offset 0x4f000, got 0x69b740ce, expected 
0x1d9fdf3, device location [0x3105041f000~1000], logical extent 
0x3cf000~1000, object 
#1:0d1f4ad3:::rbd_data.71b487741226bb.3d1a:head#
May 21 13:20:40 r1srv07 ceph-osd: 2020-05-21 13:20:40.088194 
7fa90a000700 -1 bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad 
crc32c/0x1000 checksum at blob offset 0x4f000, got 0x69b740ce, expected 
0x1d9fdf3, device location [0x3105041f000~1000], logical extent 
0x3cf000~1000, object 
#1:0d1f4ad3:::rbd_data.71b487741226bb.3d1a:head#
May 21 13:20:40 r1srv07 ceph-osd: 2020-05-21 13:20:40.088856 
7fa90a000700 -1 bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad 
crc32c/0x1000 checksum at blob offset 0x4f000, got 0x69b740ce, expected 
0x1d9fdf3, device location [0x3105041f000~1000], logical extent 
0x3cf000~1000, object 
#1:0d1f4ad3:::rbd_data.71b487741226bb.3d1a:head#
May 21 13:20:40 r1srv07 ceph-osd: 2020-05-21 13:20:40.089659 
7fa90a000700 -1 bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad 
crc32c/0x1000 checksum at blob offset 0x4f000, got 0x69b740ce, expected 
0x1d9fdf3, device location [0x3105041f000~1000], logical extent 
0x3cf000~1000, object 
#1:0d1f4ad3:::rbd_data.71b487741226bb.3d1a:head#
May 21 13:20:40 r1srv07 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/os/bluestore/BlueStore.cc: 
In function 'void BlueStore::_do_write_small(BlueStore::TransContext*, 
BlueStore::CollectionRef&, BlueStore::OnodeRef, uint64_t, uint64_t, 
ceph::buffer::list::iterator&, BlueStore::WriteContext*)' thread 
7fa90a000700 time 2020-05-21 13:20:40.089716
May 21 13:20:40 r1srv07 ceph-osd: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/os/bluestore/BlueStore.cc: 
10176: FAILED assert(r >= 0 && r <= (int)head_read)


For this same OSD 63 I tried ceph-bluestore-tool:

[root@r1srv07.pa1 ceph]# time ceph-bluestore-tool --path 
/var/lib/ceph/osd/cephpa1-63 --deep 1 --command repair ; date
2020-05-22 16:12:19.643443 7ff93afaeec0 -1 
bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad crc32c/0x1000 
checksum at blob offset 0x0, got 0x1ff97a1a, expected 0xb0ba2652, device 
location [0x1~1000], logical extent 0x0~1000, object 
#-1:7b3f43c4:::osd_superblock:0#
2020-05-22 16:12:19.644037 7ff93afaeec0 -1 
bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad crc32c/0x1000 
checksum at blob offset 0x0, got 0x1ff97a1a, expected 0xb0ba2652, device 
location [0x1~1000], logical extent 0x0~1000, object 
#-1:7b3f43c4:::osd_superblock:0#
2020-05-22 16:12:19.644542 7ff93afaeec0 -1 
bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad crc32c/0x1000 
checksum at blob offset 0x0, got 0x1ff97a1a, expected 0xb0ba2652, device 
location [0x1~1000], logical extent 0x0~1000, object 
#-1:7b3f43c4:::osd_superblock:0#
2020-05-22 16:12:19.645019 7ff93afaeec0 -1 
bluestore(/var/lib/ceph/osd/cephpa1-63) _verify_csum bad crc32c/0x1000 
checksum at blob offset 0x0, got 0x1ff97a1a, expected 0xb0ba2652, device 
loca

[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Dan van der Ster
Hi,

I'm not sure if the repair waits for snaptrim; but it does need a
scrub reservation on all the related OSDs, hence our script. And I've
also observed that the repair req isn't queued up -- if the OSDs are
busy with other scrubs, the repair req is forgotten.

-- Dan

On Wed, May 27, 2020 at 11:28 AM Daniel Aberger - Profihost AG
 wrote:
>
> Hi,
>
> (un)fortunately I can't test it because I managed to repair the pg.
>
> snaptrim and snaptrim_wait have been a part of this particular pg's
> status. As I was trying to look deeper into the case I had a watch on
> ceph health detail and noticed that snaptrim/snaptrim_wait was suddenly
> not a part of the status anymore.
>
> So I gave it another try with ceph pg repair 18.19a and suddenly the
> pg's status changed to active+clean+inconsistent+repair. It repaired
> successfully.
>
> Is snaptrim somehow blocking repair instructions? I would have thought
> that repair instructions will be queued up until they can be performed
> but it does not seem to work as I expected it to.
>
> Anyway I'll keep your script in mind and give it a shot if it happens
> again. Thank you :)
>
> Daniel
>
> Am 25.05.20 um 17:40 schrieb Dan van der Ster:
> > Hi,
> >
> > Does this help?
> >
> > https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh
> >
> > Cheers, Dan
> >
> > On Mon, May 25, 2020 at 5:18 PM Daniel Aberger - Profihost AG
> >  wrote:
> >>
> >> Hello,
> >>
> >> we are currently experiencing problems with ceph pg repair not working
> >> on Ceph Nautilus 14.2.8.
> >>
> >> ceph health detail is showing us an inconsistent pg:
> >>
> >> [ax- ~]# ceph health detail
> >> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> >> OSD_SCRUB_ERRORS 1 scrub errors
> >> PG_DAMAGED Possible data damage: 1 pg inconsistent
> >> pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting
> >> [21,15,39,18,0,9]
> >>
> >> when we try to repair it, nothing happens.
> >>
> >> [ax- ~]# ceph pg repair 18.19a
> >> instructing pg 18.19as0 on osd.21 to repair
> >>
> >> There are no new entries in OSD 21's log file.
> >>
> >> We have no trouble repairing pgs in our other clusters so I assume it
> >> might have to be something related to this cluster using Erasure
> >> Codings. But this is just a wild guess.
> >>
> >> I found a similar problem in this mailing list -
> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html
> >>
> >> Unfortunately the solution of waiting more than a week until it fixes
> >> itself isn't quite satisfying.
> >>
> >> Is there anyone who has had similar issues and knows how to repair these
> >> inconsistent pgs or what is causing the delay?
> >>
> >>
> >> --
> >> Mit freundlichen Grüßen
> >>   Daniel Aberger
> >> Ihr Profihost Team
> >>
> >> ---
> >> Profihost AG
> >> Expo Plaza 1
> >> 30539 Hannover
> >> Deutschland
> >>
> >> Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
> >> URL: http://www.profihost.com | E-Mail: i...@profihost.com
> >>
> >> Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
> >> Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
> >> Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
> >> Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Alex Gorbachev
On Wed, May 27, 2020 at 5:28 AM Daniel Aberger - Profihost AG <
d.aber...@profihost.ag> wrote:

> Hi,
>
> (un)fortunately I can't test it because I managed to repair the pg.
>
> snaptrim and snaptrim_wait have been a part of this particular pg's
> status. As I was trying to look deeper into the case I had a watch on
> ceph health detail and noticed that snaptrim/snaptrim_wait was suddenly
> not a part of the status anymore.
>
> So I gave it another try with ceph pg repair 18.19a and suddenly the
> pg's status changed to active+clean+inconsistent+repair. It repaired
> successfully.
>
> Is snaptrim somehow blocking repair instructions? I would have thought
> that repair instructions will be queued up until they can be performed
> but it does not seem to work as I expected it to.
>
> Anyway I'll keep your script in mind and give it a shot if it happens
> again. Thank you :)
>
> Daniel
>
> Am 25.05.20 um 17:40 schrieb Dan van der Ster:
> > Hi,
> >
> > Does this help?
> >
> >
> https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh
> >
> > Cheers, Dan
> >
> > On Mon, May 25, 2020 at 5:18 PM Daniel Aberger - Profihost AG
> >  wrote:
> >>
> >> Hello,
> >>
> >> we are currently experiencing problems with ceph pg repair not working
> >> on Ceph Nautilus 14.2.8.
> >>
> >> ceph health detail is showing us an inconsistent pg:
> >>
> >> [ax- ~]# ceph health detail
> >> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> >> OSD_SCRUB_ERRORS 1 scrub errors
> >> PG_DAMAGED Possible data damage: 1 pg inconsistent
> >> pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting
> >> [21,15,39,18,0,9]
> >>
> >> when we try to repair it, nothing happens.
> >>
> >> [ax- ~]# ceph pg repair 18.19a
> >> instructing pg 18.19as0 on osd.21 to repair
> >>
> >> There are no new entries in OSD 21's log file.
> >>
> >> We have no trouble repairing pgs in our other clusters so I assume it
> >> might have to be something related to this cluster using Erasure
> >> Codings. But this is just a wild guess.
> >>
> >> I found a similar problem in this mailing list -
> >>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html
> >>
> >> Unfortunately the solution of waiting more than a week until it fixes
> >> itself isn't quite satisfying.
> >>
> >> Is there anyone who has had similar issues and knows how to repair these
> >> inconsistent pgs or what is causing the delay?
> >>
>

We expect the repair to sometimes take all night, but it does happen
eventually, and we do not see any client issues while waiting for the
repair.

--
Alex Gorbachev
Intelligent Systems Services Inc.


> >>
> >> --
> >> Mit freundlichen Grüßen
> >>   Daniel Aberger
> >> Ihr Profihost Team
> >>
> >> ---
> >> Profihost AG
> >> Expo Plaza 1
> >> 30539 Hannover
> >> Deutschland
> >>
> >> Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
> >> URL: http://www.profihost.com | E-Mail: i...@profihost.com
> >>
> >> Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
> >> Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
> >> Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
> >> Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm Hangs During OSD Apply

2020-05-27 Thread m
Hi, trying to migrate a second ceph cluster to Cephadm. All the host 
successfully migrated from "legacy" except one of the OSD hosts (cephadm kept 
duplicating osd ids e.g. two "osd.5", still not sure why). To make things 
easier, we re-provisioned the node (reinstalled from netinstall, applied the 
same SaltStack traits as the other nodes, wiped the disks) and tried to use 
cephadm to setup the OSD's.

So, orch correctly starts the provisioning processes (a docker container 
running ceph-volume is created). But the provisioning never completes (docker 
exec):

# ps axu
root  1  0.1  0.2  99272 22488 ?Ss   15:26   0:01 
/usr/libexec/platform-python -s /usr/sbin/ceph-volume lvm batch --no-auto 
/dev/sdb /dev/sdc --dmcrypt --yes --no-systemd
root807  0.9  0.5 154560 44120 ?S

[ceph-users] Nautilus to Octopus Upgrade mds without downtime

2020-05-27 Thread Andreas Schiefer

Hello,

if I understand correctly:
if we upgrade from an running nautilus cluster to octopus we have a 
downtime on an update of MDS.


Is this correct?


Mit freundlichen Grüßen / Kind regards
Andreas Schiefer
Leiter Systemadministration / Head of systemadministration


---
HOME OF LOYALTY
CRM- & Customer Loyalty Solution

by UW Service
Gesellschaft für Direktwerbung und Marketingberatung mbH
Alter Deutzer Postweg 221
51107 Koeln (Rath/Heumar)
Deutschland

Telefon : +49 221 98696 0
Telefax : +49 221 98696 5222 


i...@uw-service.de
www.hooloy.de

Amtsgericht Koeln HRB 24 768
UST-ID: DE 164 191 706
Geschäftsführer: Ralf Heim
---
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm Hangs During OSD Apply

2020-05-27 Thread m
I noticed the luks volumes were open, even though luksOpen hung. I killed 
cryptsetup (once per disk) and ceph-volume continued and eventually created the 
osd's for the host (yes, this node will be slated for another reinstall when 
cephadm is stabilized).

Is there a way to remove an osd service spec with the current tooling? The 
drives are immediately locked when the node is added to orch.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: [IO-500] IO500 ISC20 Call for Submission

2020-05-27 Thread John Bent
FYI.  Hope to see some awesome CephFS submissions for our virtual IO500 BoF!

Thanks,

John

-- Forwarded message -
From: committee--- via IO-500 
Date: Fri, May 22, 2020 at 1:53 PM
Subject: [IO-500] IO500 ISC20 Call for Submission
To: 


*Deadline*: 08 June 2020 AoE

The IO500  is now accepting and encouraging submissions
for the upcoming 6th IO500 list. Once again, we are also accepting
submissions to the 10 Node Challenge to encourage the submission of small
scale results. The new ranked lists will be announced via live-stream at a
virtual session. We hope to see many new results.

The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting so it is possible to submit on a small system and still get a very
good per-client score for example. Additionally, the list is about much
more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.

Following the success of the Top500 in collecting and analyzing historical
trends in supercomputer technology and evolution, the IO500
 was created in 2017, published its first list at SC17,
and has grown exponentially since then. The need for such an initiative has
long been known within High-Performance Computing; however, defining
appropriate benchmarks had long been challenging. Despite this challenge,
the community, after long and spirited discussion, finally reached
consensus on a suite of benchmarks and a metric for resolving the scores
into a single ranking.

The multi-fold goals of the benchmark suite are as follows:

   1. Maximizing simplicity in running the benchmark suite
   2. Encouraging optimization and documentation of tuning parameters for
   performance
   3. Allowing submitters to highlight their “hero run” performance numbers
   4. Forcing submitters to simultaneously report performance for
   challenging IO patterns.

Specifically, the benchmark suite includes a hero-run of both IOR and mdtest
configured however possible to maximize performance and establish an
upper-bound for performance. It also includes an IOR and mdtest run with
highly constrained parameters forcing a difficult usage pattern in an
attempt to determine a lower-bound. Finally, it includes a namespace search
as this has been determined to be a highly sought-after feature in HPC
storage systems that has historically not been well-measured. Submitters
are encouraged to share their tuning insights for publication.

The goals of the community are also multi-fold:

   1. Gather historical data for the sake of analysis and to aid
   predictions of storage futures
   2. Collect tuning data to share valuable performance optimizations
   across the community
   3. Encourage vendors and designers to optimize for workloads beyond
   “hero runs”
   4. Establish bounded expectations for users, procurers, and
   administrators

*10 Node I/O Challenge*

The 10 Node Challenge is conducted using the regular IO500 benchmark,
however, with the rule that exactly *10 client nodes* must be used to run
the benchmark. You may use any shared storage with, e.g., any number of
servers. When submitting for the IO500 list, you can opt-in for
“Participate in the 10 compute node challenge only”, then we will not
include the results into the ranked list. Other 10-node node submissions
will be included in the full list and in the ranked list. We will announce
the result in a separate derived list and in the full list but not on the
ranked IO500 list at https://io500.org/.

This information and rules for ISC20 submissions are available here:
https://www.vi4io.org/io500/rules/submission


Thanks,


The IO500 Committee
___
IO-500 mailing list
io-...@vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus to Octopus Upgrade mds without downtime

2020-05-27 Thread Konstantin Shalygin

On 5/27/20 8:43 PM, Andreas Schiefer wrote:

if I understand correctly:
if we upgrade from an running nautilus cluster to octopus we have a 
downtime on an update of MDS.


Is this correct? 


This is always when upgrade major or minor version for MDS. It's hang 
for restart, actually clients will reconnect very fast.




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Reducing RAM usage on production MDS

2020-05-27 Thread Dylan McCulloch
Hi all,

The single active MDS on one of our Ceph clusters is close to running out of 
RAM.

MDS total system RAM = 528GB
MDS current free system RAM = 4GB
mds_cache_memory_limit = 451GB
current mds cache usage = 426GB

Presumably we need to reduce our mds_cache_memory_limit and/or 
mds_max_caps_per_client, but would like some guidance on whether it’s possible 
to do that safely on a live production cluster when the MDS is already pretty 
close to running out of RAM.

Cluster is Luminous - 12.2.12
Running single active MDS with two standby.
890 clients
Mix of kernel client (4.19.86) and ceph-fuse.
Clients are 12.2.12 (398) and 12.2.13 (3)

The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But 
the ceph-fuse clients appear to hold very large numbers according to the 
ceph-fuse asok.
e.g.
“num_caps”: 1007144398,
“num_caps”: 1150184586,
“num_caps”: 1502231153,
“num_caps”: 1714655840,
“num_caps”: 2022826512,

Dropping caches on the clients appears to reduce their cap usage but does not 
free up RAM on the MDS.
What is the safest method to free cache and reduce RAM usage on the MDS in this 
situation (without having to evict or remount clients)?
I’m concerned that reducing mds_cache_memory_limit even in very small 
increments may trigger a large recall of caps and overwhelm the MDS.
We also considered setting a reduced mds_cache_memory_limit on both the standby 
MDS. Would a subsequent failover to an MDS with a lower cache limit be safe?
Some more details below and I’d be more than happy to provide additional logs.

Thanks,
Dylan


# free -b
  totalusedfree  shared  buff/cache   available
Mem:540954992640 535268749312  4924698624   438284288   761544704  
3893182464
Swap: 0   0   0

# ceph daemon mds.$(hostname -s) config get mds_cache_memory_limit
{
"mds_cache_memory_limit": "450971566080"
}

# ceph daemon mds.$(hostname -s) cache status
{
"pool": {
"items": 10593257843,
"bytes": 425176150288
}
}

# ceph daemon mds.$(hostname -s) dump_mempools | grep -A2 "mds_co\|anon"
"buffer_anon": {
"items": 3935,
"bytes": 4537932
--
"mds_co": {
"items": 10595391186,
"bytes": 425255456209

# ceph daemon mds.$(hostname -s) perf dump | jq '.mds_mem.rss'
520100552

# ceph tell mds.$(hostname) heap stats
tcmalloc heap stats:
MALLOC:   496040753720 (473061.3 MiB) Bytes in use by application
MALLOC: +  11085479936 (10571.9 MiB) Bytes in page heap freelist
MALLOC: +  22568895888 (21523.4 MiB) Bytes in central cache freelist
MALLOC: +31744 (0.0 MiB) Bytes in transfer cache freelist
MALLOC: + 34186296 (   32.6 MiB) Bytes in thread cache freelists
MALLOC: +   2802057216 ( 2672.2 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: = 532531404800 (507861.5 MiB) Actual memory used (physical + swap)
MALLOC: +   1315700736 ( 1254.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: = 533847105536 (509116.3 MiB) Virtual address space used
MALLOC:
MALLOC:   44496459  Spans in use
MALLOC: 22  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.


# ceph fs status
hpc_projects - 890 clients

+--+++---+---+---+
| Rank | State  |  MDS   |Activity   |  dns  |  inos |
+--+++---+---+---+
|  0   | active | mds1-ceph2-qh2 | Reqs:  304 /s |  167M |  167M |
+--+++---+---+---+
++--+---+---+
|Pool|   type   |  used | avail |
++--+---+---+
|   hpcfs_metadata   | metadata | 17.4G | 1893G |
| hpcfs_data |   data   | 1014T |  379T |
|   test_nvmemeta|   data   |0  | 1893G |
| hpcfs_data_sandisk |   data   |  312T |  184T |
++--+---+---+

++
|  Standby MDS   |
++
| mds3-ceph2-qh2 |
| mds2-ceph2-qh2 |
++
MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io