[ceph-users] one osd down / rgw damoen wont start

2020-11-20 Thread Bernhard Krieger

Hello,

today i came across a strange behaviour.
After stoppping an osd, im not able to restart or /stop/start a radosgw 
daemon.

The boot proccess will stuck until i have started the osd again.


Specs:
3 ceph nodes
2 radosgw
nautilus 14.2.13
CentOS7


Steps:
* stopping radosgw daemon on rgw
* stopping  one osd on a ceph-node
* starting radosgw daemon on rgw
* rgw daemon stucks in boot proccess

2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework: civetweb
2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework conf key: port, val: 
10.220.196.31:80
2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework conf key: num_threads, 
val: 100
2020-11-20 09:58:35.412 7f8a0b7c4900  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2020-11-20 09:58:35.412 7f8a0b7c4900  0 ceph version 14.2.13 
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable), process 
radosgw, pid 8145

-- > STUCKS/no log entries

* As soon as i starting the osd, the boot sequence continues and 
everything works.


2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework: civetweb
2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework conf key: port, val: 
10.220.196.31:80
2020-11-20 09:58:35.412 7f8a0b7c4900  0 framework conf key: num_threads, 
val: 100
2020-11-20 09:58:35.412 7f8a0b7c4900  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2020-11-20 09:58:35.412 7f8a0b7c4900  0 ceph version 14.2.13 
(1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus (stable), process 
radosgw, pid 8145

2020-11-20 10:00:23.895 7f8a0b7c4900  0 starting handler: civetweb
2020-11-20 10:00:23.913 7f8a0b7c4900  1 mgrc service_daemon_register 
rgw.rgw1 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph 
version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) nautilus 
(stable),ceph_version_short=14.2.13,cpu=Intel Xeon E3-12xx v2 (Ivy 
Bridge, IBRS),distro=centos,distro_description=CentOS Linux 7 
(Core),distro_version=7,frontend_config#0=civetweb port=10.220.196.31:80 
num_threads=100,frontend_type#0=civetweb,hostname=rgw1,kernel_description=#1 
SMP Tue Oct 20 16:53:08 UTC 
2020,kernel_version=3.10.0-1160.2.2.el7.x86_64,mem_swap_kb=1048572,mem_total_kb=1881836,num_handles=1,os=Linux,pid=8145,zone_id=ecf200a8-2c4a-4c96-96d8-4fcff5b2c8c3,zone_name=default,zonegroup_id=01d11ed1-6157-4c26-addf-ecba49820e20,zonegroup_name=default}
2020-11-20 10:00:25.551 7f89d3aac700  1 civetweb: 0x557f6a6f2000: 
10.220.199.4 - - [20/Nov/2020:10:00:25 +0100] "GET / HTTP/1.0" 200 416 - -
2020-11-20 10:00:25.582 7f89d3aac700  1 civetweb: 0x557f6a6f2000: 
10.220.199.3 - - [20/Nov/2020:10:00:25 +0100] "GET / HTTP/1.0" 200 416 - -
2020-11-20 10:00:27.555 7f89d3aac700  1 civetweb: 0x557f6a6f2000: 
10.220.199.4 - - [20/Nov/2020:10:00:27 +0100] "GET / HTTP/1.0" 200 416 - -
2020-11-20 10:00:27.586 7f89d3aac700  1 civetweb: 0x557f6a6f2000: 
10.220.199.3 - - [20/Nov/2020:10:00:27 +0100] "GET / HTTP/1.0" 200 416 - -




regards
Bernhard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: one osd down / rgw damoen wont start

2020-11-20 Thread Janne Johansson
Den fre 20 nov. 2020 kl 10:17 skrev Bernhard Krieger :

> Hello,
>
> today i came across a strange behaviour.
> After stoppping an osd, im not able to restart or /stop/start a radosgw
> daemon.
> The boot proccess will stuck until i have started the osd again.
>
>
> Specs:
> 3 ceph nodes
>

What is the ceph status while one OSD is down? If this makes all PGs
inactive or worse, then any client that wants to write anything on the
cluster will block until they are writable again.


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS error: currently failed to rdlock, waiting. clients crashing and evicted

2020-11-20 Thread norman

Thomas,

This is config controled by mds's mds_cap_revoke_eviction_timeout(300s
by default). If the client crashed or hung for long time, the cluster
will evict the client.

It can prevent others hung(waiting for locks). If you're the client will
recover later, you can set it zero.

Hoping this helps.

Yours, Norman

On 18/11/2020 上午6:49, Thomas Hukkelberg wrote:

Hi all!

Hopefully some of you can shed some light on this. We have big problems with 
samba crashing when macOS smb clients access certain/random folders/files over 
vfs_ceph.

When browsing cephfs folder in question directly on a cephnode where cephfs is 
mouted we experience some issues like slow dir listing. We suspect that maybe 
macOS fetching of xattr metadata creates a lot of traffic, but it should not 
lockup the cluster like this. In logs we see both rdlock and wrlock, but mostly 
rdlocks.

End clients experience spurious disconnects when issue occurs, roughly up to a 
handfull times a day. Is this a config issue? Have we hit a bug? It's certainly 
not a feature :/

Any pointers on how to troubleshoot or rectify this problem is most welcome.

ceph version 14.2.11
samba version 4.12.10-SerNet-Ubuntu-10.focal
Supermicro X11, Intel Silver 4110, 9 ceph nodes, 2x40gbe network, 150OSD 
spinners, NVMe db/journal

--

2020-11-17 22:09:07.525706 [WRN] evicting unresponsive client bo-samba-03 
(3887652779), after 301.746 seconds
2020-11-17 22:09:07.525580 [INF] Evicting (and blacklisting) client session 
3877970532 (10.40.30.133:0/3971626932)
2020-11-17 22:09:07.525536 [WRN] evicting unresponsive client bo-samba-03 
(3877970532), after 302.034 seconds
2020-11-17 22:07:23.915412 [INF] Cluster is now healthy
2020-11-17 22:07:23.915381 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 
MDSs report slow requests)
2020-11-17 22:07:23.915330 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE 
(was: 1 clients failing to respond to capability release)
2020-11-17 22:07:23.064492 [INF] MDS health message cleared (mds.?): 1 slow 
requests are blocked > 30 secs
2020-11-17 22:07:23.064457 [INF] MDS health message cleared (mds.?): Client 
bo-samba-03 failing to respond to capability release
2020-11-17 22:07:17.524023 [WRN] client.3887663354 isn't responding to 
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, 
sent 63.325997 seconds ago
2020-11-17 22:07:17.523987 [INF] Evicting (and blacklisting) client session 
3887663354 (10.40.30.133:0/3230547239)
2020-11-17 22:07:17.523967 [WRN] evicting unresponsive client bo-samba-03 
(3887663354), after 64.5412 seconds
2020-11-17 22:07:17.523610 [WRN] slow request 63.325528 seconds old, received 
at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup 
#0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=39, 
caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:07:17.523596 [WRN] 1 slow requests, 1 included below; oldest blocked 
for > 63.325529 secs
2020-11-17 22:07:19.255177 [WRN] Health check failed: 1 clients failing to 
respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:07:12.523453 [WRN] 1 slow requests, 0 included below; oldest blocked 
for > 58.325433 secs
2020-11-17 22:07:07.523382 [WRN] 1 slow requests, 0 included below; oldest blocked 
for > 53.325362 secs
2020-11-17 22:07:02.523360 [WRN] 1 slow requests, 0 included below; oldest blocked 
for > 48.325307 secs
2020-11-17 22:06:57.523218 [WRN] 1 slow requests, 0 included below; oldest blocked 
for > 43.325199 secs
2020-11-17 22:06:52.523203 [WRN] 1 slow requests, 0 included below; oldest blocked 
for > 38.325158 secs
2020-11-17 22:06:47.523105 [WRN] slow request 33.325065 seconds old, received 
at 2020-11-17 22:06:14.197986: client_request(client.3878823430:4 lookup 
#0x100011f9a68/mappe uten navn 2020-11-17 22:06:14.197908 caller_uid=39, 
caller_gid=110513{}) currently failed to rdlock, waiting
2020-11-17 22:06:47.523100 [WRN] 1 slow requests, 1 included below; oldest blocked 
for > 33.325065 secs
2020-11-17 22:06:51.431745 [WRN] Health check failed: 1 MDSs report slow 
requests (MDS_SLOW_REQUEST)
2020-11-17 22:06:20.045030 [INF] Cluster is now healthy
2020-11-17 22:06:20.045008 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 
MDSs report slow requests)
2020-11-17 22:06:20.044960 [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE 
(was: 1 clients failing to respond to capability release)
2020-11-17 22:06:19.062307 [INF] MDS health message cleared (mds.?): 1 slow 
requests are blocked > 30 secs
2020-11-17 22:06:19.062253 [INF] MDS health message cleared (mds.?): Client 
bo-samba-03 failing to respond to capability release
2020-11-17 22:06:15.936150 [WRN] Health check failed: 1 clients failing to 
respond to capability release (MDS_CLIENT_LATE_RELEASE)
2020-11-17 22:06:12.522624 [WRN] client.3869410498 isn't responding to 
mclientcaps(revoke), ino 0x10001202b55 pending pAsLsXsFs issued pAsLsXsFsx, 
sent 64.045677 seconds ago


--thomas

--
Thomas H

[ceph-users] The serious side-effect of rbd cache setting

2020-11-20 Thread norman

Hi All,

We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore 
3-replica), and an odd problem found:


1. Setting librbd cache

[client]

rbd cache = true

rbd cache size = 16777216

rbd cache max dirty = 12582912

rbd cache target dirty = 8388608

rbd cache max dirty age = 1

rbd cache writethrough until flush = true

2. Running rbd bench

rbd -c /etc/ceph/ceph.conf \
    -k /etc/ceph/keyring2 \
    -n client.rbd-openstack-002 bench \
    --io-size 4K \
    --io-threads 1 \
    --io-pattern seq \
    --io-type read \
    --io-total 100G \
    openstack-volumes/image-you-can-drop-me
3. Start another test

rbd -c /etc/ceph/ceph.conf \

    -k /etc/ceph/keyring2 \

    -n client.rbd-openstack-002 bench \

    --io-size 4K \

    --io-threads 1 \

    --io-pattern rand \

    --io-type write \

    --io-total 100G \

    openstack-volumes/image-you-can-drop-me2

Running for minutes, I found the read test almost hung for a while:

   69    152069   2375.21  9728858.72
   70    153627   2104.63  8620569.93
   71    155748   1956.04  8011953.10
   72    157665   1945.84  7970177.24
   73    159661   1947.64  7977549.44
   74    161522   1890.45  7743277.01
   75    163583   1991.04  8155301.58
   76    165791   2008.44  8226566.26
   77    168433   2153.43  8820438.66
   78    170269   2121.43  8689377.16
   79    172511   2197.62  9001467.33
   80    174845   2252.22  9225091.00
   81    177089   2259.42  9254579.83
   82    179675   2248.22  9208708.30
   83    182053   2356.61  9652679.11
   84    185087   2515.00  10301433.50
   99    185345    550.16  2253434.96
  101    185346    407.76  1670187.73
  102    185348    282.44  1156878.38
  103    185350    162.34  664931.53
  104    185353 12.86  52681.27
  105    185357  1.93   7916.89
  106    185361  2.74  11235.38
  107    185367  3.27  13379.95
  108    185375  5.08  20794.43
  109    185384  6.93  28365.91
  110    185403  9.19  37650.06
  111    185438 17.47  71544.17
  128    185467  4.94  20243.53
  129    185468  4.45  18210.82
  131    185469  3.89  15928.44
  132    185493  4.09  16764.16
  133    185529  4.16  17037.21
  134    185578 18.64  76329.67
  135    185631 27.78  113768.65

Why this happened? It's a unacceptable performance for read.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] using fio tool in ceph development cluster (vstart.sh)

2020-11-20 Thread Bobby
Hi,

I am using the Ceph development cluster through vstart.sh script. I would
like to measure/benchmark read and write performance (benchmark ceph at a
low level). For that I want to use the fio tool.

Can we use fio on the development cluster? AFAIK, we can. I have seen
the fio option in the CMakeLists.txt of the Ceph source code.

Thanks in advance.

BR
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The serious side-effect of rbd cache setting

2020-11-20 Thread norman

If the rbd cache = false,  and run the same two tests, the read iops is
stable(this is a new cluster without stress):

  109    274471   2319.41  9500308.72
  110    276846   2380.81  9751782.65
  111    278969   2431.40  9959023.39
  112    280924   2287.21  9368428.23
  113    282886   2227.82  9125145.62
  114    286130   2331.61  9550275.83
  115    289693   2569.19  10523406.25
  116    293161   2838.17  11625140.61
  117    296484   3111.75  12745715.04
  118    300068   3436.12  14074349.33
  119    302424   3258.53  13346958.90
  120    304442   2949.56  12081397.86
  121    306988   2765.18  11326156.91
  122    309867   2676.38  10962461.69
  123    312475   2481.20  10162987.53
  124    314957   2506.40  10266198.33
  125    317124   2536.19  10388249.19
  126    320239   2649.98  10854336.06
  127    323243   2674.98  10956727.73
  128    326688   2842.37  11642342.34
  129    328855   2779.37  11384315.33
  130    331414   2857.77  11705415.59
  131    333811   2714.18  7277.84
  132    336164   2583.99  10584022.02
  133    338664   2395.01  9809941.00
  134    341417   2512.20  10289953.14
  135    344409   2598.79  10644637.88
  136    347112   2659.98  10895292.68
  137    349486   2664.18  10912494.47
  138    351921   2651.18  10859250.80
  139    354592   2634.79  10792081.86
  140    357559   2629.79  10771603.52

On 20/11/2020 下午8:50, Frank Schilder wrote:

Do you have test results for the same test without caching?

I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The benchmarks 
create IO requests much faster than OSDs can handle them. At some point all 
queues run full and you start seeing slow ops on OSDs.

I would also prefer if IO activity was more steady and not so bursty, but for 
some reason IO client throttling is pushed to the clients instead of the 
internal OPS queueing system (ceph is collaborative, meaning a rogue 
un-collaborative client can screw it up for everyone).

If you know what your IO stack can handle without stalls, you can use libvirt 
QOS settings to limit clients with reasonable peak-load and steady-load 
settings.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 20 November 2020 13:40:18
To: ceph-users
Subject: [ceph-users] The serious side-effect of rbd cache setting

Hi All,

We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore
3-replica), and an odd problem found:

1. Setting librbd cache

[client]

rbd cache = true

rbd cache size = 16777216

rbd cache max dirty = 12582912

rbd cache target dirty = 8388608

rbd cache max dirty age = 1

rbd cache writethrough until flush = true

2. Running rbd bench

rbd -c /etc/ceph/ceph.conf \
  -k /etc/ceph/keyring2 \
  -n client.rbd-openstack-002 bench \
  --io-size 4K \
  --io-threads 1 \
  --io-pattern seq \
  --io-type read \
  --io-total 100G \
  openstack-volumes/image-you-can-drop-me
3. Start another test

rbd -c /etc/ceph/ceph.conf \

  -k /etc/ceph/keyring2 \

  -n client.rbd-openstack-002 bench \

  --io-size 4K \

  --io-threads 1 \

  --io-pattern rand \

  --io-type write \

  --io-total 100G \

  openstack-volumes/image-you-can-drop-me2

Running for minutes, I found the read test almost hung for a while:

 69152069   2375.21  9728858.72
 70153627   2104.63  8620569.93
 71155748   1956.04  8011953.10
 72157665   1945.84  7970177.24
 73159661   1947.64  7977549.44
 74161522   1890.45  7743277.01
 75163583   1991.04  8155301.58
 76165791   2008.44  8226566.26
 77168433   2153.43  8820438.66
 78170269   2121.43  8689377.16
 79172511   2197.62  9001467.33
 80174845   2252.22  9225091.00
 81177089   2259.42  9254579.83
 82179675   2248.22  9208708.30
 83182053   2356.61  9652679.11
 84185087   2515.00  10301433.50
 99185345550.16  2253434.96
101185346407.76  1670187.73
102185348282.44  1156878.38
103185350162.34  664931.53
104185353 12.86  52681.27
105185357  1.93   7916.89
106185361  2.74  11235.38
107185367  3.27  13379.95
108185375  5.08  20794.43
109185384  6.93  28365.91
110185403  9.19  37650.06
111185438 17.47  71544.17
128185467  4.94  20243.53
129185468  4.45  18210.82
131185469  3.89  15928.44
132185493  4.09  16764.16
133185529  4.16  17037.21
134185578 18.64  76329.67
135185631 27.78  113768.65

Why this happened? It's a unacceptable performance for read.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___

[ceph-users] November Ceph Science User Group Virtual Meeting

2020-11-20 Thread Kevin Hrpcek

Hey all,

We will be having a Ceph science/research/big cluster call on Wednesday 
November 25th. If anyone wants to discuss something specific they can 
add it to the pad linked below. If you have questions or comments you 
can contact me.


This is an informal open call of community members mostly from 
hpc/htc/research environments where we discuss whatever is on our minds 
regarding ceph. Updates, outages, features, maintenance, etc...there is 
no set presenter but I do attempt to keep the conversation lively.


https://pad.ceph.com/p/Ceph_Science_User_Group_20201125 



We try to keep it to an hour or less.

Ceph calendar event details:

November 25th, 2020
15:00 UTC
4pm Central European
9am Central US

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink

To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
    1.) Dial: 199.48.152.152 or bjn.vc
    2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
    1.) Dial one of the following numbers: 408-915-6466 (US)
    See all numbers: https://www.redhat.com/en/conference-numbers
    2.) Enter Meeting ID: 908675367
    3.) Press #
Want to test your video connection? https://bluejeans.com/111


Kevin

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The serious side-effect of rbd cache setting

2020-11-20 Thread Frank Schilder
Do you have test results for the same test without caching?

I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The benchmarks 
create IO requests much faster than OSDs can handle them. At some point all 
queues run full and you start seeing slow ops on OSDs.

I would also prefer if IO activity was more steady and not so bursty, but for 
some reason IO client throttling is pushed to the clients instead of the 
internal OPS queueing system (ceph is collaborative, meaning a rogue 
un-collaborative client can screw it up for everyone).

If you know what your IO stack can handle without stalls, you can use libvirt 
QOS settings to limit clients with reasonable peak-load and steady-load 
settings.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 20 November 2020 13:40:18
To: ceph-users
Subject: [ceph-users] The serious side-effect of rbd cache setting

Hi All,

We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore
3-replica), and an odd problem found:

1. Setting librbd cache

[client]

rbd cache = true

rbd cache size = 16777216

rbd cache max dirty = 12582912

rbd cache target dirty = 8388608

rbd cache max dirty age = 1

rbd cache writethrough until flush = true

2. Running rbd bench

rbd -c /etc/ceph/ceph.conf \
 -k /etc/ceph/keyring2 \
 -n client.rbd-openstack-002 bench \
 --io-size 4K \
 --io-threads 1 \
 --io-pattern seq \
 --io-type read \
 --io-total 100G \
 openstack-volumes/image-you-can-drop-me
3. Start another test

rbd -c /etc/ceph/ceph.conf \

 -k /etc/ceph/keyring2 \

 -n client.rbd-openstack-002 bench \

 --io-size 4K \

 --io-threads 1 \

 --io-pattern rand \

 --io-type write \

 --io-total 100G \

 openstack-volumes/image-you-can-drop-me2

Running for minutes, I found the read test almost hung for a while:

69152069   2375.21  9728858.72
70153627   2104.63  8620569.93
71155748   1956.04  8011953.10
72157665   1945.84  7970177.24
73159661   1947.64  7977549.44
74161522   1890.45  7743277.01
75163583   1991.04  8155301.58
76165791   2008.44  8226566.26
77168433   2153.43  8820438.66
78170269   2121.43  8689377.16
79172511   2197.62  9001467.33
80174845   2252.22  9225091.00
81177089   2259.42  9254579.83
82179675   2248.22  9208708.30
83182053   2356.61  9652679.11
84185087   2515.00  10301433.50
99185345550.16  2253434.96
   101185346407.76  1670187.73
   102185348282.44  1156878.38
   103185350162.34  664931.53
   104185353 12.86  52681.27
   105185357  1.93   7916.89
   106185361  2.74  11235.38
   107185367  3.27  13379.95
   108185375  5.08  20794.43
   109185384  6.93  28365.91
   110185403  9.19  37650.06
   111185438 17.47  71544.17
   128185467  4.94  20243.53
   129185468  4.45  18210.82
   131185469  3.89  15928.44
   132185493  4.09  16764.16
   133185529  4.16  17037.21
   134185578 18.64  76329.67
   135185631 27.78  113768.65

Why this happened? It's a unacceptable performance for read.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to reshard bucket

2020-11-20 Thread Timothy Geier
On Thu, 2020-11-19 at 13:38 -0500, Eric Ivancich wrote:
> Hey Timothy,
> 
> Did you ever resolve this issue, and if so, how?

Unfortunately, I was never able to resolve it; the bucket(s) in question had to 
be recreated and then removed.

> 
> > Thank you..I looked through both logs and noticed this in the
> > cancel one:
> > 
> > osd_op(unknown.0.0:4164 41.2
> > 41:55b0279d:reshard::reshard.09:head [call
> > rgw.reshard_remove] snapc 0=[] ondisk+write+known_if_redirected
> > e24984) v8 --
> > 0x7fe9b3625710 con 0
> > osd_op_reply(4164 reshard.09 [call] v24984'105796943
> > uv105796922 ondisk = -2
> > ((2) No such file or directory)) v8  162+0+0 (203651653 0 0)
> > 0x7fe9880044a0 con
> > 0x7fe9b3625b70
> > ERROR: failed to remove entry from reshard log,
> > oid=reshard.09 tenant= bucket=foo
> > 
> > Is there anything else that I should look for?  It looks like the
> > cancel process thinks
> > that reshard.09 is present (and probably blocking my
> > attempts at resharding) but
> > it's not actually there and thus can't be removed.
> 
> Eric
> --
> J. Eric Ivancich
> he / him / his
> Red Hat Storage
> Ann Arbor, Michigan, USA
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The serious side-effect of rbd cache setting

2020-11-20 Thread Frank Schilder
Hmm, so maybe your hardware is good enough that cache is actually not helping? 
This is not unheard of. I don't really see any improvement from caching to 
begin with. On the other hand, a synthetic benchmark is not really a test that 
utilises the good sides of cache (in particular, write merges will probably not 
occur). It would probably make more sense to run real VMs with real workload 
for a while and monitor latencies etc. over a longer period of time.

Other than that I only see that the cache size is quite small. You do 100G 
random operations on a 16M cache, the default is 32M. I would not expect 
anything interesting from this ratio. Cache only makes sense if you have a lot 
of cache hits.

In addition, the max_dirty and target_dirty values are really high 
percentage-wise. This could lead to a lot of deferred  operations for too long 
and result in a cache flush blocking IO.

Larger cache size, smaller targets for dirty and a benchmark that simulates a 
realistic workload might be worth to investigate.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 20 November 2020 13:58:27
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] The serious side-effect of rbd cache setting

If the rbd cache = false,  and run the same two tests, the read iops is
stable(this is a new cluster without stress):

   109274471   2319.41  9500308.72
   110276846   2380.81  9751782.65
   111278969   2431.40  9959023.39
   112280924   2287.21  9368428.23
   113282886   2227.82  9125145.62
   114286130   2331.61  9550275.83
   115289693   2569.19  10523406.25
   116293161   2838.17  11625140.61
   117296484   3111.75  12745715.04
   118300068   3436.12  14074349.33
   119302424   3258.53  13346958.90
   120304442   2949.56  12081397.86
   121306988   2765.18  11326156.91
   122309867   2676.38  10962461.69
   123312475   2481.20  10162987.53
   124314957   2506.40  10266198.33
   125317124   2536.19  10388249.19
   126320239   2649.98  10854336.06
   127323243   2674.98  10956727.73
   128326688   2842.37  11642342.34
   129328855   2779.37  11384315.33
   130331414   2857.77  11705415.59
   131333811   2714.18  7277.84
   132336164   2583.99  10584022.02
   133338664   2395.01  9809941.00
   134341417   2512.20  10289953.14
   135344409   2598.79  10644637.88
   136347112   2659.98  10895292.68
   137349486   2664.18  10912494.47
   138351921   2651.18  10859250.80
   139354592   2634.79  10792081.86
   140357559   2629.79  10771603.52

On 20/11/2020 下午8:50, Frank Schilder wrote:
> Do you have test results for the same test without caching?
>
> I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The 
> benchmarks create IO requests much faster than OSDs can handle them. At some 
> point all queues run full and you start seeing slow ops on OSDs.
>
> I would also prefer if IO activity was more steady and not so bursty, but for 
> some reason IO client throttling is pushed to the clients instead of the 
> internal OPS queueing system (ceph is collaborative, meaning a rogue 
> un-collaborative client can screw it up for everyone).
>
> If you know what your IO stack can handle without stalls, you can use libvirt 
> QOS settings to limit clients with reasonable peak-load and steady-load 
> settings.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: norman 
> Sent: 20 November 2020 13:40:18
> To: ceph-users
> Subject: [ceph-users] The serious side-effect of rbd cache setting
>
> Hi All,
>
> We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore
> 3-replica), and an odd problem found:
>
> 1. Setting librbd cache
>
> [client]
>
> rbd cache = true
>
> rbd cache size = 16777216
>
> rbd cache max dirty = 12582912
>
> rbd cache target dirty = 8388608
>
> rbd cache max dirty age = 1
>
> rbd cache writethrough until flush = true
>
> 2. Running rbd bench
>
> rbd -c /etc/ceph/ceph.conf \
>   -k /etc/ceph/keyring2 \
>   -n client.rbd-openstack-002 bench \
>   --io-size 4K \
>   --io-threads 1 \
>   --io-pattern seq \
>   --io-type read \
>   --io-total 100G \
>   openstack-volumes/image-you-can-drop-me
> 3. Start another test
>
> rbd -c /etc/ceph/ceph.conf \
>
>   -k /etc/ceph/keyring2 \
>
>   -n client.rbd-openstack-002 bench \
>
>   --io-size 4K \
>
>   --io-threads 1 \
>
>   --io-pattern rand \
>
>   --io-type write \
>
>   --io-total 100G \
>
>   openstack-volumes/image-you-can-drop-me2
>
> Running for minutes, I found the read test almost hung for a while:
>
>  69152069   2375.21  9728858.72
>  70153627   2104.63  8620569.93
>  71155748   1956.04  8011953.10

[ceph-users] Multisite design details

2020-11-20 Thread Girish Aher
Hello All,

I am looking to understand some of the internal details on how multisite is
architected. On the Ceph user list, I see mentions of metadata logs, bucket
index shard logs etc. but there is just no documentation anywhere I could
find on how multisite works using these.

Could someone please point me in the right direction here? Apart from the
code, is there any resource that could help me with understanding the
multisite internals?


--Girish
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Documentation of older Ceph version not accessible anymore on docs.ceph.com

2020-11-20 Thread Dan Mick

On 11/14/2020 10:56 AM, Martin Palma wrote:

Hello,

maybe I missed the announcement but why is the documentation of the
older ceph version not accessible anymore on docs.ceph.com


It's changed UI because we're hosting them on readthedocs.com now.  See 
the dropdown in the lower right corner.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Documentation of older Ceph version not accessible anymore on docs.ceph.com

2020-11-20 Thread Anthony D'Atri
Same problem:

“Versions
latest  octopus nautilus
“

 This week I had to look up Jewel, Luminous, and Mimic docs and had to do so at 
GitHub.

> 
>> Hello,
>> maybe I missed the announcement but why is the documentation of the
>> older ceph version not accessible anymore on docs.ceph.com
> 
> It's changed UI because we're hosting them on readthedocs.com now.  See the 
> dropdown in the lower right corner.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io