[ceph-users] Nautilus power outage - 2/3 mons and mgrs dead and no cephfs

2019-10-11 Thread Alex L
Hi list,
Had a power outage killing the whole cluster. Cephfs will not start at all, but 
RBD works just fine.
I did have 4 unfound objects that I eventually had to rollback or delete which 
I don't really understand as I should've had a copy of the those pbjects on the 
other drives?

2/3 mons and mgrs are damaged but without any errors. 

I have loads stored on cephfs so would very much like to get that running as a 
first priority.

Thanks!
Alex

Info about the home cluster:
I run 23 osds on 3 hosts. 6 of these are a SSD cache layer for the spinning 
rust, as well as the metadata portion for cephfs which in retrospect might have 
to be put back on the spinning rust.

# ceph -v
ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable)

# head ceph-mgr.pve21.log.7
2019-10-04 00:00:00.397 7fee56df3700 -1 received  signal: Hangup from pkill -1 
-x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message 
0x55855f6b7500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:00.573 7fee545ee700  1 mgr finish mon failed to return 
metadata for mds.pve21: (2) No such file or directory
2019-10-04 00:00:01.553 7fee43aef700  0 log_channel(cluster) log [DBG] : pgmap 
v2680: 1088 pgs: 1 active+clean+inconsistent, 4 
active+recovery_unfound+undersized+degraded+remapped, 1083 active+clean; 4.2 
TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 5.7 KiB/s rd, 38 KiB/s wr, 4 
op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound (0.000%)
2019-10-04 00:00:01.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message 
0x55855e486380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:01.573 7fee545ee700  1 mgr finish mon failed to return 
metadata for mds.pve21: (2) No such file or directory
2019-10-04 00:00:02.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message 
0x55855e4b5500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 
v2:192.168.1.21:6800/3783320901
2019-10-04 00:00:02.573 7fee545ee700  1 mgr finish mon failed to return 
metadata for mds.pve21: (2) No such file or directory
2019-10-04 00:00:03.553 7fee43aef700  0 log_channel(cluster) log [DBG] : pgmap 
v2681: 1088 pgs: 1 active+clean+inconsistent, 4 
active+recovery_unfound+undersized+degraded+remapped, 1083 active+clean; 4.2 
TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 4.7 KiB/s rd, 33 KiB/s wr, 2 
op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound (0.000%)
2019-10-04 00:00:03.573 7fee44af1700  0 ms_deliver_dispatch: unhandled message 
0x55855e3b0380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 
v2:192.168.1.21:6800/3783320901

# head ceph-mon.pve21.log.7
2019-10-04 00:00:00.389 7f7c25b52700 -1 received  signal: Hangup from killall 
-q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f7c25b52700 -1 received  signal: Hangup from pkill -1 
-x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 193052) UID: 0
2019-10-04 00:00:00.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command 
mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1
2019-10-04 00:00:00.573 7f7c1f345700  0 log_channel(audit) log [DBG] : 
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": 
"mds metadata", "who": "pve21"}]: dispatch
2019-10-04 00:00:01.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command 
mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1
2019-10-04 00:00:01.573 7f7c1f345700  0 log_channel(audit) log [DBG] : 
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": 
"mds metadata", "who": "pve21"}]: dispatch
2019-10-04 00:00:02.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command 
mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1
2019-10-04 00:00:02.573 7f7c1f345700  0 log_channel(audit) log [DBG] : 
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": 
"mds metadata", "who": "pve21"}]: dispatch
2019-10-04 00:00:03.573 7f7c1f345700  0 mon.pve21@0(leader) e20 handle_command 
mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1
2019-10-04 00:00:03.573 7f7c1f345700  0 log_channel(audit) log [DBG] : 
from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": 
"mds metadata", "who": "pve21"}]: dispatch


# head ceph-mds.pve21.log.7
2019-10-04 00:00:00.389 7f1b2f1b5700 -1 received  signal: Hangup from killall 
-q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  (PID: 193051) UID: 0
2019-10-04 00:00:00.397 7f1b2f1b5700 -1 received  signal: Hangup from  (PID: 
193052) UID: 0
2019-10-04 00:00:04.881 7f1b319ba700  0 --1- 
[v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> 
v1:192.168.1.23:0/2770609702 conn(0x556f839bb200 0x556f838d4000 :6801 s=OPENED 
pgs=5 cs=3 l=0).fault server, going to 

[ceph-users] help

2019-10-11 Thread Jörg Kastning

Am 11.10.2019 um 09:21 schrieb ceph-users-requ...@ceph.io:

Send ceph-users mailing list submissions to
ceph-users@ceph.io

To subscribe or unsubscribe via email, send a message with subject or
body 'help' to
ceph-users-requ...@ceph.io

You can reach the person managing the list at
ceph-users-ow...@ceph.io

When replying, please edit your Subject line so it is more specific
than "Re: Contents of ceph-users digest..."

Today's Topics:

1. Re: MDS rejects clients causing hanging mountpoint on linux kernel client
   (Manuel Riel)
2. Re: HeartbeatMap FAILED assert(0 == "hit suicide timeout")
   (Janne Johansson)
3. Re: Nautilus: PGs stuck remapped+backfilling (Eugen Block)
4. Re: HeartbeatMap FAILED assert(0 == "hit suicide timeout") (潘东元)
5. Nautilus power outage - 2/3 mons and mgrs dead and no cephfs
   (Alex L)


--

Date: Thu, 10 Oct 2019 11:02:28 +0800
From: Manuel Riel 
Subject: [ceph-users] Re: MDS rejects clients causing hanging
mountpoint on linux kernel client
To: uker...@gmail.com
Cc: ceph-users@ceph.io
Message-ID: <3098dc39-aed4-44f0-b9cb-44b346828...@snapdragon.cc>
Content-Type: multipart/signed;
boundary="Apple-Mail=_C46A3FE0-49DB-4302-8B1B-BEC9BD6D398D";
protocol="application/pkcs7-signature"; micalg=sha-256


--Apple-Mail=_C46A3FE0-49DB-4302-8B1B-BEC9BD6D398D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=us-ascii

I noticed a similar issue tonight. Still looking into the details, but =
here are the client logs I=20

Oct  9 19:27:59 mon5-cx kernel: libceph: mds0 ***:6800 socket closed =
(con state OPEN)
Oct  9 19:28:01 mon5-cx kernel: libceph: mds0 ***:6800 connection reset
Oct  9 19:28:01 mon5-cx kernel: libceph: reset on mds0
Oct  9 19:28:01 mon5-cx kernel: ceph: mds0 closed our session
Oct  9 19:28:01 mon5-cx kernel: ceph: mds0 reconnect start
Oct  9 19:28:01 mon5-cx kernel: ceph: mds0 reconnect denied
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 9109011c9980 1099517142146
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 91096cc788d0 1099517142307
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 9107da741f10 1099517142312
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 9109d5c40e60 1099517141612
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 9108c9337da0 1099517142313
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 9109d5c70340 1099517141565
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 910955acf810 1099517141792
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 91095ff56cf0 1099517142006
Oct  9 19:28:01 mon5-cx kernel: ceph:  dropping dirty+flushing Fw state =
for 91096cc7f280 1099517142309
Oct  9 19:28:01 mon5-cx kernel: libceph: mds0 ***:6800 socket closed =
(con state NEGOTIATING)
Oct  9 19:28:02 mon5-cx kernel: ceph: mds0 rejected session
Oct  9 19:28:02 mon5-cx monit: Lookup for '/srv/repos' filesystem failed =
  -- not found in /proc/self/mounts
Oct  9 19:28:02 mon5-cx monit: Filesystem '/srv/repos' not mounted
Oct  9 19:28:02 mon5-cx monit: 'repos' unable to read filesystem =
'/srv/repos' state
...
Oct  9 19:28:09 mon5-cx kernel: ceph: get_quota_realm: ino =
(1.fffe) null i_snap_realm
Oct  9 19:28:24 mon5-cx kernel: ceph: get_quota_realm: ino =
(1.fffe) null i_snap_realm
Oct  9 19:28:39 mon5-cx kernel: ceph: get_quota_realm: ino =
(1.fffe) null i_snap_realm
...
Oct  9 21:27:09 mon5-cx kernel: ceph: get_quota_realm: ino =
(1.fffe) null i_snap_realm
Oct  9 21:27:24 mon5-cx kernel: ceph: get_quota_realm: ino =
(1.fffe) null i_snap_realm
Oct  9 21:27:27 mon5-cx monit: Lookup for '/srv/repos' filesystem failed =
  -- not found in /proc/self/mounts
Oct  9 21:27:27 mon5-cx monit: Filesystem '/srv/repos' not mounted
Oct  9 21:27:27 mon5-cx monit: 'repos' unable to read filesystem =
'/srv/repos' state
Oct  9 21:27:27 mon5-cx monit: 'repos' trying to restart


--Apple-Mail=_C46A3FE0-49DB-4302-8B1B-BEC9BD6D398D
Content-Disposition: attachment;
filename=smime.p7s
Content-Type: application/pkcs7-signature;
name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCCCx4w
ggUwMIIEGKADAgECAhEAuxgm/OfWx7qPyHlON6sLpzANBgkqhkiG9w0BAQsFADCBlzELMAkGA1UE
BhMCR0IxGzAZBgNVBAgTEkdyZWF0ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBxMHU2FsZm9yZDEaMBgG
A1UEChMRQ09NT0RPIENBIExpbWl0ZWQxPTA7BgNVBAMTNENPTU9ETyBSU0EgQ2xpZW50IEF1dGhl
bnRpY2F0aW9uIGFuZCBTZWN1cmUgRW1haWwgQ0EwHhcNMTgxMDMxMDAwMDAwWhcNMTkxMDMxMjM1
OTU5WjAjMSEwHwYJKoZIhvcNAQkBFhJtYW51QHNuYXBkcmFnb24uY2MwggEiMA0GCSqGSIb3DQEB
AQUAA4IBDwAwggEKAoIBAQDSCVHH0d9wqo1reA0PKmK8x88NJQcffMt3E

[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Frank Schilder
You meta data PGs *are* backfilling. It is the "61 keys/s" statement in the 
ceph status output in the recovery I/O line. If this is too slow, increase 
osd_max_backfills and osd_recovery_max_active.

Or just have some coffee ...

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 10 October 2019 14:54:37
To: ceph-users@ceph.io
Subject: [ceph-users] Nautilus: PGs stuck remapped+backfilling

Hi all,

I have a strange issue with backfilling and I'm not sure what the cause is.
It's a Nautilus cluster (upgraded) that has an SSD cache tier for
OpenStack and CephFS metadata residing on the same SSDs, there were
three SSDs in total.
Today I added two new SSDs (NVMe) (osd.15, osd.16) to be able to
shutoff one old server that has only one SSD-OSD left (osd.20).
Setting the crush weight of osd.20 to 0 (and adjusting the weight of
the remaining SSDs for an even distribution) leaves 3 PGs in
active+remapped+backfilling state. I don't understand why the
remaining PGs aren't backfilling, the crush rule is quite simple (all
ssd pools are replicated with size 3). The backfilling PGs are all
from the cephfs_metadata pool. Although there are 4 SSDs for 3
replicas the backfilling still should finish, right?

Can anyone share their thoughts why 3 PGs can't be recovered? If more
information about the cluster is required please let me know.

Regards,
Eugen


ceph01:~ # ceph osd pool ls detail | grep meta
pool 36 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 16 pgp_num 16 last_change 283362 flags
hashpspool,nodelete,nodeep-scrub stripe_width 0 application cephfs


ceph01:~ # ceph pg dump | grep remapp
dumped all
36.b  28306  00 28910   0
8388608   101408323 219497 3078 3078
active+remapped+backfilling 2019-10-10 13:36:27.427527
284595'98565869  284595:254216941 [15,16,9] 15
 [20,9,10] 20  284427'98489406 2019-10-10
00:16:02.682911  284089'98003598 2019-10-06 16:03:27.558267
  0
36.d  28087  00 25327   0
26375382   106722204 231020 3041 3041
active+remapped+backfilling 2019-10-10 13:36:27.404739
284595'97933905  284595:252878816 [16,15,9] 16
 [20,9,10] 20  284427'97887652 2019-10-10
04:13:29.371905  284259'97502135 2019-10-07 20:06:43.304593
  0
36.4  28060  00 28406   0
8389242   104059103 225188 3061 3061
active+remapped+backfilling 2019-10-10 13:36:27.440390
284595'105299618  284595:312976619 [16,9,15] 16
  [20,9,10] 20 284427'105218591 2019-10-10
00:18:07.924006 284089'104696098 2019-10-06 16:20:17.123149
  0


rule ssd_ruleset {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take default class ssd
 step chooseleaf firstn 0 type host
 step emit
}

This is the relevant part of the osd tree:

ceph01:~ #  ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
  -1   34.21628 root default
-31   11.25406 host ceph01
  25   hdd  3.5 osd.25up  1.0 1.0
  26   hdd  3.5 osd.26up  1.0 1.0
  27   hdd  3.5 osd.27up  1.0 1.0
  15   ssd  0.45409 osd.15up  1.0 1.0
-34   11.25406 host ceph02
   0   hdd  3.5 osd.0 up  1.0 1.0
  28   hdd  3.5 osd.28up  1.0 1.0
  29   hdd  3.5 osd.29up  1.0 1.0
  16   ssd  0.45409 osd.16up  1.0 1.0
-37   10.7 host ceph03
  31   hdd  3.5 osd.31up  1.0 1.0
  32   hdd  3.5 osd.32up  1.0 1.0
  33   hdd  3.5 osd.33up  1.0 1.0
-240.45409 host san01-ssd
  10   ssd  0.45409 osd.10up  1.0 1.0
-230.45409 host san02-ssd
   9   ssd  0.45409 osd.9 up  1.0 1.0
-22  0 host san03-ssd
  20   ssd0 osd.20up  1.0 1.0


Don't be confused because of the '-ssd' suffix, we're using crush
location hooks.
This is the current PG distribution on the SSDs:

ceph01:~ # ceph osd df | grep -E "^15 |^16 |^ 9|^10 |^20 "
15   ssd 0.45409  1.0 465 GiB  34 GiB  32 GiB 1.2 GiB  857 MiB 431
GiB  7.29 0.22  27 up
16   ssd 0.45409  1.0 465 GiB  37 GiB  34 GiB 1.5 GiB  964 MiB 428
GiB  7.87 0.23  31 up
10   ssd 0.45409  1.0 745 GiB  27 GiB  25 GiB 1.7 GiB  950 MiB 718
GiB  3.65 0.11  29 up
  9   ssd 0.45409  1.0 745 GiB  34 GiB  32 GiB 1.3 GiB  902 MiB
711 GiB  4.60 0.14  30 up
20   ssd   0  1.0 894 GiB 8.2 GiB 4.3 GiB 1.5 GiB  2.4 GiB 886
GiB

[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Eugen Block
You meta data PGs *are* backfilling. It is the "61 keys/s" statement  
in the ceph status output in the recovery I/O line. If this is too  
slow, increase osd_max_backfills and osd_recovery_max_active.


Or just have some coffee ...



I already had increased osd_max_backfills and osd_recovery_max_active  
in order to speed things up, and most of the PGs were remapped pretty  
quick (couple of minutes), but these last 3 PGs took almost two hours  
to complete, which was unexpected.



Zitat von Frank Schilder :

You meta data PGs *are* backfilling. It is the "61 keys/s" statement  
in the ceph status output in the recovery I/O line. If this is too  
slow, increase osd_max_backfills and osd_recovery_max_active.


Or just have some coffee ...

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 10 October 2019 14:54:37
To: ceph-users@ceph.io
Subject: [ceph-users] Nautilus: PGs stuck remapped+backfilling

Hi all,

I have a strange issue with backfilling and I'm not sure what the cause is.
It's a Nautilus cluster (upgraded) that has an SSD cache tier for
OpenStack and CephFS metadata residing on the same SSDs, there were
three SSDs in total.
Today I added two new SSDs (NVMe) (osd.15, osd.16) to be able to
shutoff one old server that has only one SSD-OSD left (osd.20).
Setting the crush weight of osd.20 to 0 (and adjusting the weight of
the remaining SSDs for an even distribution) leaves 3 PGs in
active+remapped+backfilling state. I don't understand why the
remaining PGs aren't backfilling, the crush rule is quite simple (all
ssd pools are replicated with size 3). The backfilling PGs are all
from the cephfs_metadata pool. Although there are 4 SSDs for 3
replicas the backfilling still should finish, right?

Can anyone share their thoughts why 3 PGs can't be recovered? If more
information about the cluster is required please let me know.

Regards,
Eugen


ceph01:~ # ceph osd pool ls detail | grep meta
pool 36 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 16 pgp_num 16 last_change 283362 flags
hashpspool,nodelete,nodeep-scrub stripe_width 0 application cephfs


ceph01:~ # ceph pg dump | grep remapp
dumped all
36.b  28306  00 28910   0
8388608   101408323 219497 3078 3078
active+remapped+backfilling 2019-10-10 13:36:27.427527
284595'98565869  284595:254216941 [15,16,9] 15
 [20,9,10] 20  284427'98489406 2019-10-10
00:16:02.682911  284089'98003598 2019-10-06 16:03:27.558267
  0
36.d  28087  00 25327   0
26375382   106722204 231020 3041 3041
active+remapped+backfilling 2019-10-10 13:36:27.404739
284595'97933905  284595:252878816 [16,15,9] 16
 [20,9,10] 20  284427'97887652 2019-10-10
04:13:29.371905  284259'97502135 2019-10-07 20:06:43.304593
  0
36.4  28060  00 28406   0
8389242   104059103 225188 3061 3061
active+remapped+backfilling 2019-10-10 13:36:27.440390
284595'105299618  284595:312976619 [16,9,15] 16
  [20,9,10] 20 284427'105218591 2019-10-10
00:18:07.924006 284089'104696098 2019-10-06 16:20:17.123149
  0


rule ssd_ruleset {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take default class ssd
 step chooseleaf firstn 0 type host
 step emit
}

This is the relevant part of the osd tree:

ceph01:~ #  ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
  -1   34.21628 root default
-31   11.25406 host ceph01
  25   hdd  3.5 osd.25up  1.0 1.0
  26   hdd  3.5 osd.26up  1.0 1.0
  27   hdd  3.5 osd.27up  1.0 1.0
  15   ssd  0.45409 osd.15up  1.0 1.0
-34   11.25406 host ceph02
   0   hdd  3.5 osd.0 up  1.0 1.0
  28   hdd  3.5 osd.28up  1.0 1.0
  29   hdd  3.5 osd.29up  1.0 1.0
  16   ssd  0.45409 osd.16up  1.0 1.0
-37   10.7 host ceph03
  31   hdd  3.5 osd.31up  1.0 1.0
  32   hdd  3.5 osd.32up  1.0 1.0
  33   hdd  3.5 osd.33up  1.0 1.0
-240.45409 host san01-ssd
  10   ssd  0.45409 osd.10up  1.0 1.0
-230.45409 host san02-ssd
   9   ssd  0.45409 osd.9 up  1.0 1.0
-22  0 host san03-ssd
  20   ssd0 osd.20up  1.0 1.0


Don't be confused because of the '-ssd' suffix, we're using crush
location hooks.
This is the current PG distribution on the SSDs:

c

[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Frank Schilder
I did a lot of data movement lately and my observation is, that backfill is 
very fast (high bandwidth and many thousand keys/s) as long as this is 
many-to-many OSDs. The number of OSD participating slowly decreases over time 
until there is only 1 disk left that is written to. This becomes really slow, 
because the recovery options are for keeping all-to-all under control.

In such a case, you might want to temporarily increase these numbers to 
something really high (not 10 or 20, but 1000 or 2000; increase in steps) until 
the single-disk write is over and then set it back again. With SSD this should 
be OK.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 11 October 2019 10:24
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Nautilus: PGs stuck remapped+backfilling

> You meta data PGs *are* backfilling. It is the "61 keys/s" statement
> in the ceph status output in the recovery I/O line. If this is too
> slow, increase osd_max_backfills and osd_recovery_max_active.
>
> Or just have some coffee ...


I already had increased osd_max_backfills and osd_recovery_max_active
in order to speed things up, and most of the PGs were remapped pretty
quick (couple of minutes), but these last 3 PGs took almost two hours
to complete, which was unexpected.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Eugen Block
Yeah we also noticed decreasing recovery speed if it comes to the last  
PGs, but we never put up a theory. I think your explanation makes  
sense. Next time I'll try with much higher values, thanks for sharing  
that.


Regards,
Eugen


Zitat von Frank Schilder :

I did a lot of data movement lately and my observation is, that  
backfill is very fast (high bandwidth and many thousand keys/s) as  
long as this is many-to-many OSDs. The number of OSD participating  
slowly decreases over time until there is only 1 disk left that is  
written to. This becomes really slow, because the recovery options  
are for keeping all-to-all under control.


In such a case, you might want to temporarily increase these numbers  
to something really high (not 10 or 20, but 1000 or 2000; increase  
in steps) until the single-disk write is over and then set it back  
again. With SSD this should be OK.


Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: 11 October 2019 10:24
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Nautilus: PGs stuck remapped+backfilling


You meta data PGs *are* backfilling. It is the "61 keys/s" statement
in the ceph status output in the recovery I/O line. If this is too
slow, increase osd_max_backfills and osd_recovery_max_active.

Or just have some coffee ...



I already had increased osd_max_backfills and osd_recovery_max_active
in order to speed things up, and most of the PGs were remapped pretty
quick (couple of minutes), but these last 3 PGs took almost two hours
to complete, which was unexpected.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Anthony D'Atri
Parallelism.  The backfill/recovery tunables control how many recovery ops a 
given OSD will perform.  If you’re adding a new OSD, naturally it is the 
bottleneck.  For other forms of data movement, early on one has multiple OSDs 
reading and writing independently.  Toward the end, increasingly fewer OSDs 
still have work to do, so there’s a long tail as they complete their queues.

> On Oct 11, 2019, at 4:42 AM, Eugen Block  wrote:
> 
> Yeah we also noticed decreasing recovery speed if it comes to the last PGs, 
> but we never put up a theory. I think your explanation makes sense. Next time 
> I'll try with much higher values, thanks for sharing that.
> 
> Regards,
> Eugen
> 
> 
> Zitat von Frank Schilder :
> 
>> I did a lot of data movement lately and my observation is, that backfill is 
>> very fast (high bandwidth and many thousand keys/s) as long as this is 
>> many-to-many OSDs. The number of OSD participating slowly decreases over 
>> time until there is only 1 disk left that is written to. This becomes really 
>> slow, because the recovery options are for keeping all-to-all under control.
>> 
>> In such a case, you might want to temporarily increase these numbers to 
>> something really high (not 10 or 20, but 1000 or 2000; increase in steps) 
>> until the single-disk write is over and then set it back again. With SSD 
>> this should be OK.
>> 
>> Best regards,
>> 
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Eugen Block 
>> Sent: 11 October 2019 10:24
>> To: Frank Schilder
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] Nautilus: PGs stuck remapped+backfilling
>> 
>>> You meta data PGs *are* backfilling. It is the "61 keys/s" statement
>>> in the ceph status output in the recovery I/O line. If this is too
>>> slow, increase osd_max_backfills and osd_recovery_max_active.
>>> 
>>> Or just have some coffee ...
>> 
>> 
>> I already had increased osd_max_backfills and osd_recovery_max_active
>> in order to speed things up, and most of the PGs were remapped pretty
>> quick (couple of minutes), but these last 3 PGs took almost two hours
>> to complete, which was unexpected.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: PGs stuck remapped+backfilling

2019-10-11 Thread Anthony D'Atri
Very large omaps can take quite a while.

> 
>> You meta data PGs *are* backfilling. It is the "61 keys/s" statement in the 
>> ceph status output in the recovery I/O line. If this is too slow, increase 
>> osd_max_backfills and osd_recovery_max_active.
>> 
>> Or just have some coffee ...
> 
> 
> I already had increased osd_max_backfills and osd_recovery_max_active in 
> order to speed things up, and most of the PGs were remapped pretty quick 
> (couple of minutes), but these last 3 PGs took almost two hours to complete, 
> which was unexpected.
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RadosGW max worker threads

2019-10-11 Thread Benjamin . Zieglmeier
Hello all,

Looking for guidance on the recommended highest setting (or input on 
experiences from users who have a high setting) for rgw_thread_pool_size. We 
are running multiple Luminous 12.2.11 clusters with usually 3-4 RGW daemons in 
front of them. We set our rgw_thread_pool_size at 512 out of the gate, and run 
civetweb. We had occasional service outages in one of our clusters this week 
and determined the rgws were running out of available threads to handle 
requests. We doubled our thread pool size to 1024 on each rgw and everything 
has been ok so far.

What, if any, would be the high-end limit to set for rgw_thread_pool_size? I’ve 
been unable to find anything in the documentation or the user list that depicts 
anything higher than the default 100 threads.

Thanks,
Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW max worker threads

2019-10-11 Thread Paul Emmerich
you probably want to increase the number of civetweb threads, that's a
parameter for civetweb in the rgw_frontends configuration (IIRC it's
threads=xyz)

Also, consider upgrading and use Beast, it's so much better for rgw
setups that get lots of requests.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Oct 11, 2019 at 10:02 PM Benjamin.Zieglmeier
 wrote:
>
> Hello all,
>
>
>
> Looking for guidance on the recommended highest setting (or input on 
> experiences from users who have a high setting) for rgw_thread_pool_size. We 
> are running multiple Luminous 12.2.11 clusters with usually 3-4 RGW daemons 
> in front of them. We set our rgw_thread_pool_size at 512 out of the gate, and 
> run civetweb. We had occasional service outages in one of our clusters this 
> week and determined the rgws were running out of available threads to handle 
> requests. We doubled our thread pool size to 1024 on each rgw and everything 
> has been ok so far.
>
>
>
> What, if any, would be the high-end limit to set for rgw_thread_pool_size? 
> I’ve been unable to find anything in the documentation or the user list that 
> depicts anything higher than the default 100 threads.
>
>
>
> Thanks,
>
> Ben
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW max worker threads

2019-10-11 Thread Paul Emmerich
Which defaults to rgw_thread_pool_size, so yeah, you can adjust that option.

To answer your actual question: we've run civetweb with 1024 threads
with no problems related to the number of threads.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Oct 11, 2019 at 10:50 PM Paul Emmerich  wrote:
>
> you probably want to increase the number of civetweb threads, that's a
> parameter for civetweb in the rgw_frontends configuration (IIRC it's
> threads=xyz)
>
> Also, consider upgrading and use Beast, it's so much better for rgw
> setups that get lots of requests.
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Fri, Oct 11, 2019 at 10:02 PM Benjamin.Zieglmeier
>  wrote:
> >
> > Hello all,
> >
> >
> >
> > Looking for guidance on the recommended highest setting (or input on 
> > experiences from users who have a high setting) for rgw_thread_pool_size. 
> > We are running multiple Luminous 12.2.11 clusters with usually 3-4 RGW 
> > daemons in front of them. We set our rgw_thread_pool_size at 512 out of the 
> > gate, and run civetweb. We had occasional service outages in one of our 
> > clusters this week and determined the rgws were running out of available 
> > threads to handle requests. We doubled our thread pool size to 1024 on each 
> > rgw and everything has been ok so far.
> >
> >
> >
> > What, if any, would be the high-end limit to set for rgw_thread_pool_size? 
> > I’ve been unable to find anything in the documentation or the user list 
> > that depicts anything higher than the default 100 threads.
> >
> >
> >
> > Thanks,
> >
> > Ben
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW max worker threads

2019-10-11 Thread Anthony D'Atri
We’ve running with 2000 fwiw.

> On Oct 11, 2019, at 2:02 PM, Paul Emmerich  wrote:
> 
> Which defaults to rgw_thread_pool_size, so yeah, you can adjust that option.
> 
> To answer your actual question: we've run civetweb with 1024 threads
> with no problems related to the number of threads.
> 
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Fri, Oct 11, 2019 at 10:50 PM Paul Emmerich  wrote:
>> 
>> you probably want to increase the number of civetweb threads, that's a
>> parameter for civetweb in the rgw_frontends configuration (IIRC it's
>> threads=xyz)
>> 
>> Also, consider upgrading and use Beast, it's so much better for rgw
>> setups that get lots of requests.
>> 
>> Paul
>> 
>> --
>> Paul Emmerich
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>> 
>> On Fri, Oct 11, 2019 at 10:02 PM Benjamin.Zieglmeier
>>  wrote:
>>> 
>>> Hello all,
>>> 
>>> 
>>> 
>>> Looking for guidance on the recommended highest setting (or input on 
>>> experiences from users who have a high setting) for rgw_thread_pool_size. 
>>> We are running multiple Luminous 12.2.11 clusters with usually 3-4 RGW 
>>> daemons in front of them. We set our rgw_thread_pool_size at 512 out of the 
>>> gate, and run civetweb. We had occasional service outages in one of our 
>>> clusters this week and determined the rgws were running out of available 
>>> threads to handle requests. We doubled our thread pool size to 1024 on each 
>>> rgw and everything has been ok so far.
>>> 
>>> 
>>> 
>>> What, if any, would be the high-end limit to set for rgw_thread_pool_size? 
>>> I’ve been unable to find anything in the documentation or the user list 
>>> that depicts anything higher than the default 100 threads.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Ben
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RadosGW max worker threads

2019-10-11 Thread JC Lopez
Hi All,

currently running some tests and I have run with up to 2048 without any problem.

As per the code here is what it says:

#ifndef MAX_WORKER_THREADS
#define MAX_WORKER_THREADS (1024 * 64)
#endif

This value was introduced via 
https://github.com/ceph/civetweb/commit/8a07012185851b8e8be180391866b5995c10ee93

Regards
JC

> On Oct 11, 2019, at 15:09, Anthony D'Atri  wrote:
> 
> We’ve running with 2000 fwiw.
> 
>> On Oct 11, 2019, at 2:02 PM, Paul Emmerich  wrote:
>> 
>> Which defaults to rgw_thread_pool_size, so yeah, you can adjust that option.
>> 
>> To answer your actual question: we've run civetweb with 1024 threads
>> with no problems related to the number of threads.
>> 
>> 
>> Paul
>> 
>> -- 
>> Paul Emmerich
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>> 
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>> 
>> On Fri, Oct 11, 2019 at 10:50 PM Paul Emmerich  
>> wrote:
>>> 
>>> you probably want to increase the number of civetweb threads, that's a
>>> parameter for civetweb in the rgw_frontends configuration (IIRC it's
>>> threads=xyz)
>>> 
>>> Also, consider upgrading and use Beast, it's so much better for rgw
>>> setups that get lots of requests.
>>> 
>>> Paul
>>> 
>>> --
>>> Paul Emmerich
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>> 
>>> croit GmbH
>>> Freseniusstr. 31h
>>> 81247 München
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>> 
>>> On Fri, Oct 11, 2019 at 10:02 PM Benjamin.Zieglmeier
>>>  wrote:
 
 Hello all,
 
 
 
 Looking for guidance on the recommended highest setting (or input on 
 experiences from users who have a high setting) for rgw_thread_pool_size. 
 We are running multiple Luminous 12.2.11 clusters with usually 3-4 RGW 
 daemons in front of them. We set our rgw_thread_pool_size at 512 out of 
 the gate, and run civetweb. We had occasional service outages in one of 
 our clusters this week and determined the rgws were running out of 
 available threads to handle requests. We doubled our thread pool size to 
 1024 on each rgw and everything has been ok so far.
 
 
 
 What, if any, would be the high-end limit to set for rgw_thread_pool_size? 
 I’ve been unable to find anything in the documentation or the user list 
 that depicts anything higher than the default 100 threads.
 
 
 
 Thanks,
 
 Ben
 
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io