Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-04 Thread Marc Roos
 >
 >Try something like the following on each OSD that holds a copy of
 >rbd_data.1f114174b0dc51.0974 and see what output you get.
 >Note that you can drop the bluestore flag if they are not bluestore
 >osds and you will need the osd stopped at the time (set noout). Also
 >note, snapids are displayed in hexadecimal in the output (but then '4'
 >is '4' so not a big issues here).
 >
 >$ ceph-objectstore-tool --type bluestore --data-path
 >/var/lib/ceph/osd/ceph-XX/ --pgid 17.36 --op list
 >rbd_data.1f114174b0dc51.0974

I got these results

osd.7
Error getting attr on : 17.36_head,#-19:6c00:::scrub_17.36:head#, 
(61) No data available
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]

osd.12
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]

osd.29
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":63,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]
["17.36",{"oid":"rbd_data.1f114174b0dc51.0974","key":"","sna
pid":-2,"hash":1357874486,"max":0,"pool":17,"namespace":"","max":0}]


 >
 >The likely issue here is the primary believes snapshot 4 is gone but
 >there is still data and/or metadata on one of the replicas which is
 >confusing the issue. If that is the case you can use the the
 >ceph-objectstore-tool to delete the relevant snapshot(s)
 >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon sudden crash loop - pinned map

2019-10-04 Thread Philippe D'Anjou
Hi,our mon is acting up all of a sudden and dying in crash loop with the 
following:

2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
    -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has v0 
lc 4549352
    -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
closest pinned map ver 252615 not available! error: (2) No such file or 
directory
    -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
/build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
7f6e5d461700 time 2019-10-04 14:00:24.347580
/build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x7f6e68eb064e]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char 
const*, ...)+0) [0x7f6e68eb0829]
 3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
 4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
 5: 
(OSDMonitor::encode_trim_extra(std::shared_ptr, 
unsigned long)+0x8c) [0x717c3c]
 6: (PaxosService::maybe_trim()+0x473) [0x707443]
 7: (Monitor::tick()+0xa9) [0x5ecf39]
 8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
 9: (Context::complete(int)+0x9) [0x6070d9]
 10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
 11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
 12: (()+0x76ba) [0x7f6e67cab6ba]
 13: (clone()+0x6d) [0x7f6e674d441d]

 0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
 in thread 7f6e5d461700 thread_name:safe_timer

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)
 1: (()+0x11390) [0x7f6e67cb5390]
 2: (gsignal()+0x38) [0x7f6e67402428]
 3: (abort()+0x16a) [0x7f6e6740402a]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x7f6e68eb069f]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char 
const*, ...)+0) [0x7f6e68eb0829]
 6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
 7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
 8: 
(OSDMonitor::encode_trim_extra(std::shared_ptr, 
unsigned long)+0x8c) [0x717c3c]
 9: (PaxosService::maybe_trim()+0x473) [0x707443]
 10: (Monitor::tick()+0xa9) [0x5ecf39]
 11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
 12: (Context::complete(int)+0x9) [0x6070d9]
 13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
 14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
 15: (()+0x76ba) [0x7f6e67cab6ba]
 16: (clone()+0x6d) [0x7f6e674d441d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


This was running fine for 2months now, it's a crashed cluster that is in 
recovery.
Any suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to set osd_crush_initial_weight 0 without restart any service

2019-10-04 Thread Paul Mezzanini
That would accomplish what you are looking for, yes.

Keep in mind that with norebalance that won't stop NEW data from landing there. 
 It will only keep old data from migrating in.  This shouldn't pose too much of 
an issue for most use cases.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.



From: Satish Patel 
Sent: Tuesday, October 1, 2019 2:45 PM
To: Paul Mezzanini
Cc: ceph-users
Subject: Re: [ceph-users] how to set osd_crush_initial_weight 0 without restart 
any service

You are saying set "ceph osd set norebalance" before running
ceph-ansible playbook to add OSD

once osd visible in "ceph osd tree"  then i should do reweight to 0
and then do "ceph osd unset norebalance"

On Tue, Oct 1, 2019 at 2:41 PM Paul Mezzanini  wrote:
>
> You could also:
> ceph osd set norebalance
>
>
> --
> Paul Mezzanini
> Sr Systems Administrator / Engineer, Research Computing
> Information & Technology Services
> Finance & Administration
> Rochester Institute of Technology
> o:(585) 475-3245 | pfm...@rit.edu
>
> CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
> intended only for the person(s) or entity to which it is addressed and may
> contain confidential and/or privileged material. Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon this
> information by persons or entities other than the intended recipient is
> prohibited. If you received this in error, please contact the sender and
> destroy any copies of this information.
> 
>
> 
> From: ceph-users  on behalf of Satish 
> Patel 
> Sent: Tuesday, October 1, 2019 2:34 PM
> To: ceph-users
> Subject: [ceph-users] how to set osd_crush_initial_weight 0 without restart 
> any service
>
> Folks,
>
> Method: 1
>
> In my lab i am playing with ceph and trying to understand how to add
> new OSD without starting rebalancing.
>
> I want to add this option on fly so i don't need to restart any
> services or anything.
>
> $ ceph tell mon.* injectargs '--osd_crush_initial_weight 0'
>
> $ ceph daemon /var/run/ceph/ceph-mon.*.asok config show | grep
> osd_crush_initial_weight
> "osd_crush_initial_weight": "0.00",
>
> All looks good, now i am adding OSD with ceph-ansible and you know
> what look like it don't honer that option and adding OSD with default
> weight (In my case i have 1.9TB SSD so weight is 1.7)
>
> Can someone confirm injectargs work with osd_crush_initial_weight ?
>
>
> Method: 2
>
> Now i have added that option in ceph-ansible playbook like following
>
> ceph_conf_overrides:
>   osd:
> osd_crush_initial_weight: 0
>
> and i run playbook and it did magic and added OSD with weight zero (0)
>  but i have notice it restarted all OSD daemon on that node, i am
> worried is it safe to restart osd daemon on ceph in production?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw: multisite support

2019-10-04 Thread Joachim Kraftmayer

Maybe this will help you:

https://docs.ceph.com/docs/master/radosgw/multisite/#migrating-a-single-site-system-to-multi-site

___

Clyso GmbH


Am 03.10.2019 um 13:32 schrieb M Ranga Swami Reddy:

Thank you. Do we have a quick document to do this migration?

Thanks
Swami

On Thu, Oct 3, 2019 at 4:38 PM Paul Emmerich > wrote:


On Thu, Oct 3, 2019 at 12:03 PM M Ranga Swami Reddy
mailto:swamire...@gmail.com>> wrote:
>
> Below url says: "Switching from a standalone deployment to a
multi-site replicated deployment is not supported.
>

https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-rgw-multisite.html

this is wrong, might be a weird openstack-specific restriction.

Migrating single-site to multi-site is trivial, you just add the
second site.


Paul

>
> Please advise.
>
>
> On Thu, Oct 3, 2019 at 3:28 PM M Ranga Swami Reddy
mailto:swamire...@gmail.com>> wrote:
>>
>> Hi,
>> Iam using the 2 ceph clusters in diff DCs (away by 500 KM) with
ceph 12.2.11 version.
>> Now, I want to setup rgw multisite using the above 2 ceph clusters.
>>
>> is it possible? if yes, please share good document to do the same.
>>
>> Thanks
>> Swami
>
> ___
> Dev mailing list -- d...@ceph.io 
> To unsubscribe send an email to dev-le...@ceph.io



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ssd requirements for wal/db

2019-10-04 Thread Kenneth Waegeman

Hi all,

We are thinking about putting our wal/db of hdds/ on ssds. If we would 
put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would 
suffice?

We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs.

Does someone has some experience with this configuration? Would we need 
SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive?



Thank you very much!


Kenneth


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Petr Bena

Hello,

If this is too long for you, TL;DR; section on the bottom

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD 
(WD RED spinning drives) and I would like to optimize the performance of 
RBD, which I believe is blocked by some wrong CEPH configuration, 
because from my observation all resources (CPU, RAM, network, disks) are 
basically unused / idling even when I put load on the RBD.


Each drive should be 50MB/s read / write and when I run RADOS benchmark, 
I see values that are somewhat acceptable, interesting part is that when 
I run RADOS benchmark, I can see all disks read / write to their limits, 
I can see heavy network utilization and even some CPU utilization - on 
other hand, when I put any load on the RBD device, performance is 
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s), 
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most 
weird part - resources are almost unused - no CPU usage, no network 
traffic, minimal disk activity.


It looks to me like if CEPH wasn't even trying to perform much as long 
as the access is via RBD, did anyone ever saw this kind of issue? Is 
there any way to track down why it is so slow? Here are some outputs:


[root@ceph1 cephadm]# ceph --version
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)

[root@ceph1 cephadm]# ceph health
HEALTH_OK

I would expect write speed to be at least the 50MB/s which is speed when 
writing to disks directly, rados bench does this speed (sometimes even 
more):


[root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
4194304 for up to 10 seconds or 0 objects

Object prefix: benchmark_data_ceph1.lan.insw.cz_60873
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg 
lat(s)

    0   0 0 0 0 0 -   0
    1  16    22 6   23.9966    24 0.966194    0.565671
    2  16    37    21   41.9945    60 1.86665    0.720606
    3  16    54    38   50.6597    68 1.07856    0.797677
    4  16    70    54   53.9928    64 1.58914 0.86644
    5  16    83    67   53.5924    52 0.208535    0.884525
    6  16    97    81   53.9923    56 2.22661    0.932738
    7  16   111    95   54.2781    56 1.0294    0.964574
    8  16   133   117   58.4921    88 0.883543 1.03648
    9  16   143   127   56.4369    40 0.352169 1.00382
   10  16   154   138   55.1916    44 0.227044 1.04071

Read speed is even higher as it's probably reading from multiple devices 
at once:


[root@ceph1 cephadm]# rados bench -p testbench 100 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s) avg 
lat(s)

    0   0 0 0 0 0 -   0
    1  16    96    80   319.934   320 0.811192    0.174081
    2  13   161   148   295.952   272 0.606672    0.181417


Running rbd bench show writes at 50MB/s (which is OK) and reads at 
20MB/s (not so OK), but the REAL performance is much worse - when I 
actually access the block device and try to write or read anything it's 
sometimes extremely low as in 5MB/s or 20MB/s only.


Why is that? What can I do to debug / trace / optimize this issue? I 
don't know if there is any point in upgrading the hardware if according 
to monitoring current HW is basically not being utilized at all.



TL;DR;

I created a ceph cluster from 6 OSD (dedicated 1G net, 6 4TB spinning 
drives), the rados performance benchmark shows acceptable performance, 
but RBD peformance is absolutely terrible (very slow read and very slow 
write). When I put any kind of load on cluster almost all resources are 
unused / idling, so this makes me feel like software configuration issue.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Alexandre DERUMIER
Hi,

>>dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -

you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.


Try to test with fio for example, with bigger iodepth,  small block/big block , 
seq/rand.



- Mail original -
De: "Petr Bena" 
À: "ceph-users" 
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet: [ceph-users] Optimizing terrible RBD performance

Hello, 

If this is too long for you, TL;DR; section on the bottom 

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD 
(WD RED spinning drives) and I would like to optimize the performance of 
RBD, which I believe is blocked by some wrong CEPH configuration, 
because from my observation all resources (CPU, RAM, network, disks) are 
basically unused / idling even when I put load on the RBD. 

Each drive should be 50MB/s read / write and when I run RADOS benchmark, 
I see values that are somewhat acceptable, interesting part is that when 
I run RADOS benchmark, I can see all disks read / write to their limits, 
I can see heavy network utilization and even some CPU utilization - on 
other hand, when I put any load on the RBD device, performance is 
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s), 
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most 
weird part - resources are almost unused - no CPU usage, no network 
traffic, minimal disk activity. 

It looks to me like if CEPH wasn't even trying to perform much as long 
as the access is via RBD, did anyone ever saw this kind of issue? Is 
there any way to track down why it is so slow? Here are some outputs: 

[root@ceph1 cephadm]# ceph --version 
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable) 
[root@ceph1 cephadm]# ceph health 
HEALTH_OK 

I would expect write speed to be at least the 50MB/s which is speed when 
writing to disks directly, rados bench does this speed (sometimes even 
more): 

[root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup 
hints = 1 
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
4194304 for up to 10 seconds or 0 objects 
Object prefix: benchmark_data_ceph1.lan.insw.cz_60873 
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg 
lat(s) 
0 0 0 0 0 0 - 0 
1 16 22 6 23.9966 24 0.966194 0.565671 
2 16 37 21 41.9945 60 1.86665 0.720606 
3 16 54 38 50.6597 68 1.07856 0.797677 
4 16 70 54 53.9928 64 1.58914 0.86644 
5 16 83 67 53.5924 52 0.208535 0.884525 
6 16 97 81 53.9923 56 2.22661 0.932738 
7 16 111 95 54.2781 56 1.0294 0.964574 
8 16 133 117 58.4921 88 0.883543 1.03648 
9 16 143 127 56.4369 40 0.352169 1.00382 
10 16 154 138 55.1916 44 0.227044 1.04071 

Read speed is even higher as it's probably reading from multiple devices 
at once: 

[root@ceph1 cephadm]# rados bench -p testbench 100 seq 
hints = 1 
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg 
lat(s) 
0 0 0 0 0 0 - 0 
1 16 96 80 319.934 320 0.811192 0.174081 
2 13 161 148 295.952 272 0.606672 0.181417 


Running rbd bench show writes at 50MB/s (which is OK) and reads at 
20MB/s (not so OK), but the REAL performance is much worse - when I 
actually access the block device and try to write or read anything it's 
sometimes extremely low as in 5MB/s or 20MB/s only. 

Why is that? What can I do to debug / trace / optimize this issue? I 
don't know if there is any point in upgrading the hardware if according 
to monitoring current HW is basically not being utilized at all. 


TL;DR; 

I created a ceph cluster from 6 OSD (dedicated 1G net, 6 4TB spinning 
drives), the rados performance benchmark shows acceptable performance, 
but RBD peformance is absolutely terrible (very slow read and very slow 
write). When I put any kind of load on cluster almost all resources are 
unused / idling, so this makes me feel like software configuration issue. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Petr Bena

Hello,

I tried to use FIO on RBD device I just created and writing is really 
terrible (around 1.5MB/s)


[root@ceph3 tmp]# fio test.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=rbd, iodepth=32

fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 IOPS][eta 
00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct  4 
17:25:24 2019

  write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
    slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
    clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
 lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
    clat percentiles (msec):
 |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   12],
 | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 60.00th=[   89],
 | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  218],
 | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  768],
 | 99.99th=[  793]
   bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, 
stdev=754.35, samples=52

   iops    : min=  130, max= 1162, avg=383.33, stdev=188.61, samples=52
  lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
  lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 1000=0.08%
  cpu  : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, 
>=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
>=64=0.0%

 issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s 
(1574kB/s-1574kB/s), io=39.1MiB (40.0MB), run=26016-26016msec


Disk stats (read/write):
    dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, aggrin_queue=6353, 
aggrutil=9.07%

  xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, util=9.07%


Uncomparably worse to RADOS bench results

On 04/10/2019 17:15, Alexandre DERUMIER wrote:

Hi,


dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -

you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.


Try to test with fio for example, with bigger iodepth,  small block/big block , 
seq/rand.



- Mail original -
De: "Petr Bena" 
À: "ceph-users" 
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet: [ceph-users] Optimizing terrible RBD performance

Hello,

If this is too long for you, TL;DR; section on the bottom

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD
(WD RED spinning drives) and I would like to optimize the performance of
RBD, which I believe is blocked by some wrong CEPH configuration,
because from my observation all resources (CPU, RAM, network, disks) are
basically unused / idling even when I put load on the RBD.

Each drive should be 50MB/s read / write and when I run RADOS benchmark,
I see values that are somewhat acceptable, interesting part is that when
I run RADOS benchmark, I can see all disks read / write to their limits,
I can see heavy network utilization and even some CPU utilization - on
other hand, when I put any load on the RBD device, performance is
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s),
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most
weird part - resources are almost unused - no CPU usage, no network
traffic, minimal disk activity.

It looks to me like if CEPH wasn't even trying to perform much as long
as the access is via RBD, did anyone ever saw this kind of issue? Is
there any way to track down why it is so slow? Here are some outputs:

[root@ceph1 cephadm]# ceph --version
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus
(stable)
[root@ceph1 cephadm]# ceph health
HEALTH_OK

I would expect write speed to be at least the 50MB/s which is speed when
writing to disks directly, rados bench does this speed (sometimes even
more):

[root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph1.lan.insw.cz_60873
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg
lat(s)
0 0 0 0 0 0 - 0
1 16 22 6 23.9966 24 0.966194 0.565671
2 16 37 21 41.9945 60 1.86665 0.720606
3 16 54 38 50.6597 68 1.07856 0.797677
4 16 70 54 53.9928 64 1.58914 0.86644
5 16 83 67 53.5924 52 0.208535 0.884525
6 16 97 81 53.9923 56 2.22661 0.932738
7 16 111 95 54.2781 56 1.0294 0.964574
8 16 133 117 58.4921 88 0.883543 1.03648
9 16 143 127 56.4369 40 0.352169 

Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread JC Lopez
Hi,

your RBD bench and RADOS bench use by default 4MB IO request size while your 
FIO is configured for 4KB IO request size.

If you want to compare apple 2 apple (bandwidth) you need to change the FIO IO 
request size to 4194304. Plus, you tested a sequential workload with RADOS 
bench but random with fio.

Make sure you align all parameters to obtain results you can compare

Other note: What block size did you specify with your dd command?

By default block size is equal to 512 bytes so even smaller than the 4KB you 
used for FIO and miles away from the 4MB you used for RADOS bench. Be mindful 
that 5MB/s for your dd with BS=512 is about 1 IOPS.

JC

> On Oct 4, 2019, at 08:28, Petr Bena  wrote:
> 
> Hello,
> 
> I tried to use FIO on RBD device I just created and writing is really 
> terrible (around 1.5MB/s)
> 
> [root@ceph3 tmp]# fio test.fio
> rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
> 4096B-4096B, ioengine=rbd, iodepth=32
> fio-3.7
> Starting 1 process
> Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 IOPS][eta 
> 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct  4 17:25:24 
> 2019
>   write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
> slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
> clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
>  lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
> clat percentiles (msec):
>  |  1.00th=[3],  5.00th=[5], 10.00th=[7], 20.00th=[   12],
>  | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 60.00th=[   89],
>  | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  218],
>  | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  768],
>  | 99.99th=[  793]
>bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, stdev=754.35, 
> samples=52
>iops: min=  130, max= 1162, avg=383.33, stdev=188.61, samples=52
>   lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
>   lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 1000=0.08%
>   cpu  : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
>  latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s (1574kB/s-1574kB/s), 
> io=39.1MiB (40.0MB), run=26016-26016msec
> 
> Disk stats (read/write):
> dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
> aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, aggrin_queue=6353, 
> aggrutil=9.07%
>   xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, util=9.07%
> 
> 
> Uncomparably worse to RADOS bench results
> 
> On 04/10/2019 17:15, Alexandre DERUMIER wrote:
>> Hi,
>> 
 dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -
>> you are testing with a single thread/iodepth=1 sequentially here.
>> Then only 1 disk at time, and you have network latency too.
>> 
>> rados bench is doing 16 concurrent write.
>> 
>> 
>> Try to test with fio for example, with bigger iodepth,  small block/big 
>> block , seq/rand.
>> 
>> 
>> 
>> - Mail original -
>> De: "Petr Bena" 
>> À: "ceph-users" 
>> Envoyé: Vendredi 4 Octobre 2019 17:06:48
>> Objet: [ceph-users] Optimizing terrible RBD performance
>> 
>> Hello,
>> 
>> If this is too long for you, TL;DR; section on the bottom
>> 
>> I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD
>> (WD RED spinning drives) and I would like to optimize the performance of
>> RBD, which I believe is blocked by some wrong CEPH configuration,
>> because from my observation all resources (CPU, RAM, network, disks) are
>> basically unused / idling even when I put load on the RBD.
>> 
>> Each drive should be 50MB/s read / write and when I run RADOS benchmark,
>> I see values that are somewhat acceptable, interesting part is that when
>> I run RADOS benchmark, I can see all disks read / write to their limits,
>> I can see heavy network utilization and even some CPU utilization - on
>> other hand, when I put any load on the RBD device, performance is
>> terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s),
>> running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most
>> weird part - resources are almost unused - no CPU usage, no network
>> traffic, minimal disk activity.
>> 
>> It looks to me like if CEPH wasn't even trying to perform much as long
>> as the access is via RBD, did anyone ever saw this kind of issue? Is
>> there any way to track down why it is so slow? Here are some outputs:
>> 
>> [root@ceph1 cephadm]# ceph --version
>> ceph ver

Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Maged Mokhtar
The tests are measuring differing things, and fio test result of 1.5 
MB/s is not bad.


The rados write bench uses by default 4M block size and does 16 threads 
and is random in nature, you can change the block size and thread count.


The dd command uses by default 512 block size and and 1 thread and is 
sequential in nature. You can change the block size via bs to 4M and it 
will give high results, it will also use buffered io unless you make it 
non buffered (oflag=direct).


with fio you have full control on block size, threads, rand/seq, 
buffered, direct, sync..etc. The fio test you are running uses 32 queue 
depths / threads, 4k random write. To compare with rados, change the 
block size to 4M and make it sequential.


The 1.58 MB/s is not bad for the test. At 4k this is 400 iops, if you 
are doing standard 3x replias, your cluster is doing 1200 iops and this 
is just for client data, it does have other overhead like metada db 
lookups/updates so it is actually doing more, but even 1200 random iops 
for 6 spinning disk gives 200 random iops per disk which is acceptable.


/Maged


On 04/10/2019 17:28, Petr Bena wrote:

Hello,

I tried to use FIO on RBD device I just created and writing is really 
terrible (around 1.5MB/s)


[root@ceph3 tmp]# fio test.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32

fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 
IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct  4 
17:25:24 2019

  write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
    slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
    clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
 lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
    clat percentiles (msec):
 |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   
12],
 | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 60.00th=[   
89],
 | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  
218],
 | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  
768],

 | 99.99th=[  793]
   bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, 
stdev=754.35, samples=52
   iops    : min=  130, max= 1162, avg=383.33, stdev=188.61, 
samples=52

  lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
  lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 1000=0.08%
  cpu  : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, 
>=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
>=64=0.0%

 issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s 
(1574kB/s-1574kB/s), io=39.1MiB (40.0MB), run=26016-26016msec


Disk stats (read/write):
    dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, 
aggrin_queue=6353, aggrutil=9.07%
  xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, 
util=9.07%



Uncomparably worse to RADOS bench results

On 04/10/2019 17:15, Alexandre DERUMIER wrote:

Hi,


dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -

you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.


Try to test with fio for example, with bigger iodepth,  small 
block/big block , seq/rand.




- Mail original -
De: "Petr Bena" 
À: "ceph-users" 
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet: [ceph-users] Optimizing terrible RBD performance

Hello,

If this is too long for you, TL;DR; section on the bottom

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD
(WD RED spinning drives) and I would like to optimize the performance of
RBD, which I believe is blocked by some wrong CEPH configuration,
because from my observation all resources (CPU, RAM, network, disks) are
basically unused / idling even when I put load on the RBD.

Each drive should be 50MB/s read / write and when I run RADOS benchmark,
I see values that are somewhat acceptable, interesting part is that when
I run RADOS benchmark, I can see all disks read / write to their limits,
I can see heavy network utilization and even some CPU utilization - on
other hand, when I put any load on the RBD device, performance is
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s),
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most
weird part - resources are almost unused - no CPU usage, no network
traffic, minimal disk activity.

It looks to me like if CEPH wasn't even trying to p

Re: [ceph-users] ssd requirements for wal/db

2019-10-04 Thread Vitaliy Filippov
WAL/DB isn't "read intensive". It's more "write intensive" :) use server  
SSDs with capacitors to get adequate write performance.



Hi all,

We are thinking about putting our wal/db of hdds/ on ssds. If we would
put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would
suffice?
We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs.

Does someone has some experience with this configuration? Would we need
SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ssd requirements for wal/db

2019-10-04 Thread Stijn De Weirdt
hi all,

maybe to clarify a bit, e.g.
https://indico.cern.ch/event/755842/contributions/3243386/attachments/1784159/2904041/2019-jcollet-openlab.pdf
clearly shows that the db+wal disks are not saturated,
but we are wondering what is really needed/acceptable wrt throughput and
latency (eg is a 6gbps sata enough or is 12gbps sas needed); we are
thinking combining 4 or 5 7.2k rpms disks with one ssd.

similar question with the read-intensive: how much is actually written
to the db+wal compared to the data disk? is that 1-to-1?
do people see eg 1 DWPD on their db+wal devices? (i guess it depends;)
if so, what kind of workload daily averages are this in terms of volume?

thanks for pointing out the capacitor isue, something to defintely
double check for the (cheaper) read intensive ssd.


stijn

On 10/4/19 7:29 PM, Vitaliy Filippov wrote:
> WAL/DB isn't "read intensive". It's more "write intensive" :) use server
> SSDs with capacitors to get adequate write performance.
> 
>> Hi all,
>>
>> We are thinking about putting our wal/db of hdds/ on ssds. If we would
>> put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would
>> suffice?
>> We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs.
>>
>> Does someone has some experience with this configuration? Would we need
>> SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive?
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw: multisite support

2019-10-04 Thread DHilsbos
Swami;

For 12.2.11 (Luminous), the previously linked document would be:
https://docs.ceph.com/docs/luminous/radosgw/multisite/#migrating-a-single-site-system-to-multi-site

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Joachim Kraftmayer
Sent: Friday, October 04, 2019 7:50 AM
To: M Ranga Swami Reddy
Cc: ceph-users; d...@ceph.io
Subject: Re: [ceph-users] rgw: multisite support

Maybe this will help you:
https://docs.ceph.com/docs/master/radosgw/multisite/#migrating-a-single-site-system-to-multi-site

___

Clyso GmbH


Am 03.10.2019 um 13:32 schrieb M Ranga Swami Reddy:
Thank you. Do we have a quick document to do this migration? 

Thanks
Swami

On Thu, Oct 3, 2019 at 4:38 PM Paul Emmerich  wrote:
On Thu, Oct 3, 2019 at 12:03 PM M Ranga Swami Reddy
 wrote:
>
> Below url says: "Switching from a standalone deployment to a multi-site 
> replicated deployment is not supported.
> https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-rgw-multisite.html

this is wrong, might be a weird openstack-specific restriction.

Migrating single-site to multi-site is trivial, you just add the second site.


Paul

>
> Please advise.
>
>
> On Thu, Oct 3, 2019 at 3:28 PM M Ranga Swami Reddy  
> wrote:
>>
>> Hi,
>> Iam using the 2 ceph clusters in diff DCs (away by 500 KM) with ceph 12.2.11 
>> version.
>> Now, I want to setup rgw multisite using the above 2 ceph clusters.
>>
>> is it possible? if yes, please share good document to do the same.
>>
>> Thanks
>> Swami
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Petr Bena

Thank you guys,

I changed FIO parameters and it looks far better now - reading about 
150MB/s, writing over 60MB/s


Now, the question is, what could I change in my setup to make it this 
fast - the RBD is used as LVM PV for a VG shared between Xen 
hypervisors, this is the PV:


  --- Physical volume ---
  PV Name   /dev/rbd0
  VG Name VG_XenStorage-275588a7-4895-9073-aa81-61a3d98dfba7
  PV Size   4.00 TiB / not usable 0
  Allocatable   yes
  PE Size   4.00 MiB
  Total PE  1048573
  Free PE   740048
  Allocated PE  308525
  PV UUID   IieC3P-2dw4-Zotx-ZG8v-TKV0-WBBP-5YQF4P

Physical extent size is 4MB, but not sure if that really means anything 
in this sense, I am not sure if LVM subsystem on Linux can be tweaked in 
how large block that is being read / written is? Is there anything I can 
do to improve the performance, except for replacing with SSD disks? Does 
it mean that IOPS is my bottleneck now?


On 04/10/2019 18:53, Maged Mokhtar wrote:
The tests are measuring differing things, and fio test result of 1.5 
MB/s is not bad.


The rados write bench uses by default 4M block size and does 16 
threads and is random in nature, you can change the block size and 
thread count.


The dd command uses by default 512 block size and and 1 thread and is 
sequential in nature. You can change the block size via bs to 4M and 
it will give high results, it will also use buffered io unless you 
make it non buffered (oflag=direct).


with fio you have full control on block size, threads, rand/seq, 
buffered, direct, sync..etc. The fio test you are running uses 32 
queue depths / threads, 4k random write. To compare with rados, change 
the block size to 4M and make it sequential.


The 1.58 MB/s is not bad for the test. At 4k this is 400 iops, if you 
are doing standard 3x replias, your cluster is doing 1200 iops and 
this is just for client data, it does have other overhead like metada 
db lookups/updates so it is actually doing more, but even 1200 random 
iops for 6 spinning disk gives 200 random iops per disk which is 
acceptable.


/Maged


On 04/10/2019 17:28, Petr Bena wrote:

Hello,

I tried to use FIO on RBD device I just created and writing is really 
terrible (around 1.5MB/s)


[root@ceph3 tmp]# fio test.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32

fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 
IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct 4 
17:25:24 2019

  write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
    slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
    clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
 lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
    clat percentiles (msec):
 |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   
12],
 | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 60.00th=[   
89],
 | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  
218],
 | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  
768],

 | 99.99th=[  793]
   bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, 
stdev=754.35, samples=52
   iops    : min=  130, max= 1162, avg=383.33, stdev=188.61, 
samples=52

  lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
  lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 
1000=0.08%

  cpu  : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%, 
>=64=0.0%
 submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, 
>=64=0.0%

 issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=1538KiB/s (1574kB/s), 1538KiB/s-1538KiB/s 
(1574kB/s-1574kB/s), io=39.1MiB (40.0MB), run=26016-26016msec


Disk stats (read/write):
    dm-6: ios=0/2, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, 
aggrios=20/368, aggrmerge=0/195, aggrticks=105/6248, 
aggrin_queue=6353, aggrutil=9.07%
  xvda: ios=20/368, merge=0/195, ticks=105/6248, in_queue=6353, 
util=9.07%



Uncomparably worse to RADOS bench results

On 04/10/2019 17:15, Alexandre DERUMIER wrote:

Hi,


dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -

you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.


Try to test with fio for example, with bigger iodepth,  small 
block/big block , seq/rand.




- Mail original -
De: "Petr Bena" 
À: "ceph-users" 
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet: [ceph-users] Optimizing terrible RBD performance

Hel

Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Maged Mokhtar


The 4M throughput numbers you see now ( 150 MB/s read, 60 MB/s write) 
are probably limited by your 1G network, and can probably go higher if 
you increase it ( 10G or use active bonds).


In real life, the applications and wokloads determine the block size, io 
depths, whether it is sequential or random, whether it uses cache 
buffering or requests to to bypass the cache. Very few applications 
(such as backup) allow you to specify some such settings.


So what you could do is understand your workload : is it backups that 
uses sequential large block or is it virtualization or databases that 
require high iops with small block sizes. Then use a tool like fio to 
see what your hardware is able to provide under different 
configurations,  it is also a  good idea to run a load collection tool 
like atop/sar/collectl during such tests to know what are your 
bottlenecks to help you change configurations ( adding osds, nodes, 
different disk types, network configuration ).


For example if your workload is backups, and need to go above 150 MB/s, 
first step is bump up your 1G network, then if you still need higher 
throughput, you may add more osds then nodes ..etc.


If your workload requires like 50k random iops, you will not be able to 
achieve this with hdds


/Maged

On 04/10/2019 21:00, Petr Bena wrote:

Thank you guys,

I changed FIO parameters and it looks far better now - reading about 
150MB/s, writing over 60MB/s


Now, the question is, what could I change in my setup to make it this 
fast - the RBD is used as LVM PV for a VG shared between Xen 
hypervisors, this is the PV:


  --- Physical volume ---
  PV Name   /dev/rbd0
  VG Name VG_XenStorage-275588a7-4895-9073-aa81-61a3d98dfba7
  PV Size   4.00 TiB / not usable 0
  Allocatable   yes
  PE Size   4.00 MiB
  Total PE  1048573
  Free PE   740048
  Allocated PE  308525
  PV UUID   IieC3P-2dw4-Zotx-ZG8v-TKV0-WBBP-5YQF4P

Physical extent size is 4MB, but not sure if that really means 
anything in this sense, I am not sure if LVM subsystem on Linux can be 
tweaked in how large block that is being read / written is? Is there 
anything I can do to improve the performance, except for replacing 
with SSD disks? Does it mean that IOPS is my bottleneck now?


On 04/10/2019 18:53, Maged Mokhtar wrote:
The tests are measuring differing things, and fio test result of 1.5 
MB/s is not bad.


The rados write bench uses by default 4M block size and does 16 
threads and is random in nature, you can change the block size and 
thread count.


The dd command uses by default 512 block size and and 1 thread and is 
sequential in nature. You can change the block size via bs to 4M and 
it will give high results, it will also use buffered io unless you 
make it non buffered (oflag=direct).


with fio you have full control on block size, threads, rand/seq, 
buffered, direct, sync..etc. The fio test you are running uses 32 
queue depths / threads, 4k random write. To compare with rados, 
change the block size to 4M and make it sequential.


The 1.58 MB/s is not bad for the test. At 4k this is 400 iops, if you 
are doing standard 3x replias, your cluster is doing 1200 iops and 
this is just for client data, it does have other overhead like metada 
db lookups/updates so it is actually doing more, but even 1200 random 
iops for 6 spinning disk gives 200 random iops per disk which is 
acceptable.


/Maged


On 04/10/2019 17:28, Petr Bena wrote:

Hello,

I tried to use FIO on RBD device I just created and writing is 
really terrible (around 1.5MB/s)


[root@ceph3 tmp]# fio test.fio
rbd_iodepth32: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=32

fio-3.7
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=1628KiB/s][r=0,w=407 
IOPS][eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=115425: Fri Oct 4 
17:25:24 2019

  write: IOPS=384, BW=1538KiB/s (1574kB/s)(39.1MiB/26016msec)
    slat (nsec): min=1452, max=591931, avg=14498.83, stdev=17295.97
    clat (usec): min=1795, max=793172, avg=83218.39, stdev=83485.65
 lat (usec): min=1810, max=793201, avg=83232.89, stdev=83485.19
    clat percentiles (msec):
 |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 
20.00th=[   12],
 | 30.00th=[   21], 40.00th=[   36], 50.00th=[   61], 
60.00th=[   89],
 | 70.00th=[  116], 80.00th=[  146], 90.00th=[  190], 95.00th=[  
218],
 | 99.00th=[  380], 99.50th=[  430], 99.90th=[  625], 99.95th=[  
768],

 | 99.99th=[  793]
   bw (  KiB/s): min=  520, max= 4648, per=99.77%, avg=1533.40, 
stdev=754.35, samples=52
   iops    : min=  130, max= 1162, avg=383.33, stdev=188.61, 
samples=52

  lat (msec)   : 2=0.08%, 4=4.77%, 10=13.56%, 20=11.66%, 50=16.40%
  lat (msec)   : 100=17.66%, 250=32.53%, 500=3.05%, 750=0.21%, 
1000=0.08%

  cpu  : usr=0.57%, sys=0.52%, ctx=3976, majf=0, minf=8489
  IO depths    : 1=0.1%, 2=

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-04 Thread Gregory Farnum
Hmm, that assert means the monitor tried to grab an OSDMap it had on
disk but it didn't work. (In particular, a "pinned" full map which we
kept around after trimming the others to save on disk space.)

That *could* be a bug where we didn't have the pinned map and should
have (or incorrectly thought we should have), but this code was in
Mimic as well as Nautilus and I haven't seen similar reports. So it
could also mean that something bad happened to the monitor's disk or
Rocksdb store. Can you turn it off and rebuild from the remainder, or
do they all exhibit this bug?


On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou
 wrote:
>
> Hi,
> our mon is acting up all of a sudden and dying in crash loop with the 
> following:
>
>
> 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352
> -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 
> mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) 
> is_readable = 1 - now=2019-10-04 14:00:24.339620 lease_expire=0.00 has v0 
> lc 4549352
> -2> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> mon.km-fsn-1-dc4-m1-797678@0(leader).osd e257349 get_full_from_pinned_map 
> closest pinned map ver 252615 not available! error: (2) No such file or 
> directory
> -1> 2019-10-04 14:00:24.343 7f6e5d461700 -1 
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: In function 'int 
> OSDMonitor::get_full_from_pinned_map(version_t, ceph::bufferlist&)' thread 
> 7f6e5d461700 time 2019-10-04 14:00:24.347580
> /build/ceph-14.2.4/src/mon/OSDMonitor.cc: 3932: FAILED ceph_assert(err == 0)
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x152) [0x7f6e68eb064e]
>  2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  3: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  4: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  5: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  6: (PaxosService::maybe_trim()+0x473) [0x707443]
>  7: (Monitor::tick()+0xa9) [0x5ecf39]
>  8: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  9: (Context::complete(int)+0x9) [0x6070d9]
>  10: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  11: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  12: (()+0x76ba) [0x7f6e67cab6ba]
>  13: (clone()+0x6d) [0x7f6e674d441d]
>
>  0> 2019-10-04 14:00:24.347 7f6e5d461700 -1 *** Caught signal (Aborted) **
>  in thread 7f6e5d461700 thread_name:safe_timer
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (()+0x11390) [0x7f6e67cb5390]
>  2: (gsignal()+0x38) [0x7f6e67402428]
>  3: (abort()+0x16a) [0x7f6e6740402a]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x1a3) [0x7f6e68eb069f]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, 
> char const*, ...)+0) [0x7f6e68eb0829]
>  6: (OSDMonitor::get_full_from_pinned_map(unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x80b) [0x72802b]
>  7: (OSDMonitor::get_version_full(unsigned long, unsigned long, 
> ceph::buffer::v14_2_0::list&)+0x3d2) [0x728c82]
>  8: 
> (OSDMonitor::encode_trim_extra(std::shared_ptr, 
> unsigned long)+0x8c) [0x717c3c]
>  9: (PaxosService::maybe_trim()+0x473) [0x707443]
>  10: (Monitor::tick()+0xa9) [0x5ecf39]
>  11: (C_MonContext::finish(int)+0x39) [0x5c3f29]
>  12: (Context::complete(int)+0x9) [0x6070d9]
>  13: (SafeTimer::timer_thread()+0x190) [0x7f6e68f45580]
>  14: (SafeTimerThread::entry()+0xd) [0x7f6e68f46e4d]
>  15: (()+0x76ba) [0x7f6e67cab6ba]
>  16: (clone()+0x6d) [0x7f6e674d441d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
>
> This was running fine for 2months now, it's a crashed cluster that is in 
> recovery.
>
> Any suggestions?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-04 Thread Gregory Farnum
Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:
>
> And, just as unexpectedly, things have returned to normal overnight
> https://icecube.wisc.edu/~vbrik/graph-1.png
>
> The change seems to have coincided with the beginning of Rados Gateway
> activity (before, it was essentially zero). I can see nothing in the
> logs that would explain what happened though.
>
> Vlad
>
>
>
> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> > Hello
> >
> > I am running a Ceph 14.2.2 cluster and a few days ago, memory
> > consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> > after being stable for about 6 months.
> >
> > Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> > Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >
> > I am not sure what changed to cause this. Cluster usage has been very
> > light (typically <10 iops) during this period, and the number of objects
> > stayed about the same.
> >
> > The only unusual occurrence was the reboot of one of the nodes the day
> > before (a firmware update). For the reboot, I ran "ceph osd set noout",
> > but forgot to unset it until several days later. Unsetting noout did not
> > stop the increase in memory consumption.
> >
> > I don't see anything unusual in the logs.
> >
> > Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> > 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> > don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> > utilized, with 101-104 PGs.
> >
> > Does anybody know what might be the problem here and how to address or
> > debug it?
> >
> >
> > Thanks very much,
> >
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com