[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Danni Setiawan
We have similar with this issue last week. We have sluggish disk (10TB 
SAS in RAID 0 mode) in half of node which affect performance of cluster. 
These disk has high CPU usage and very high latency. Turns out there is 
a process *patrol read* from RAID card that running automatically every 
week. When we stop patrol read, everything is normal again.
We also running on Ceph 14.2.11. We don't have this issue with previous 
Ceph version and never change setting of patrol read.


Thanks.

On 06/10/20 17.04, Kristof Coucke wrote:

Another strange thing is going on:

No client software is using the system any longer, so we would expect that
all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the PGs
that are being fixed.

So, something is heavily using the disk, but I can't find the process
immediately. I've read something that there can be old client processes
that keep on connecting to an OSD for retrieving data for a specific PG
while that PG is no longer available on that disk.


Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke 
:
Yes, some disks are spiking near 100%... The delay I see with the iostat
(r_await) seems to be synchronised with the delays between queued_for_pg
and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is also
128kb (os level) (for spinner disks). As we are dealing with smaller files,
this might also lead to a decrease of the performance.

I'm still investigating, but I'm wondering if the system is also reading
from disk for finding the KV pairs.



Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov :


Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow  ops?


Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:

Hi all,

We have a Ceph cluster which has been expanded from 10 to 16 nodes.
Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
Most disks (except NVMe's) are 16TB large.

The expansion of 16 nodes went ok, but we've configured the system to
prevent auto balance towards the new disks (weight was set to 0) so we
could control the expansion.

We started adding 6 disks last week (1 disk on each new node) which

didn't

give a lot of issues.
When the Ceph status indicated the PG degraded was almost finished,

we've

added 2 disks on each node again.

All seemed to go fine, till yesterday morning... IOs towards the system
were slowing down.

Diving onto the nodes we could see that the OSD daemons are consuming

the

CPU power, resulting in average CPU loads going near 10 (!).

The RGWs nor monitors nor other involved servers are having CPU issues
(except for the management server which is fighting with Prometheus), so
it's latency seems to be related to the ODS hosts.
All of the hosts are interconnected with 25Gbit connections, no

bottlenecks

are reached on the network either.

Important piece of information: We are using erasure coding (6/3), and

we

do have a lot of small files...
The current health detail indicates degraded health redundancy where
1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg

undersized).

Diving into the historic ops of an OSD we can see that the main latency

is

found between the event "queued_for_pg" and "reached_pg". (Averaging

+/- 3

secs)

As the system load is quite high I assume the systems are busy
recalculating the code chunks for using the new disks we've added

(though

not sure), but I was wondering how I can better fine tune the system or
pinpoint the exact bottle neck.
Latency towards the disks doesn't seem an issue at first sight...

We are running Ceph 14.2.11

Who can give me some thoughts on how I can better pinpoint the bottle

neck?

Thanks

Kristof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-15 Thread Danni Setiawan

Hi all,

I'm trying to find performance penalty with OSD HDD when using WAL/DB in 
faster device (SSD/NVMe) vs WAL/DB in same device (HDD) for different 
workload (RBD, RGW with index bucket in SSD pool, and CephFS with 
metadata in SSD pool). I want to know if giving up disk slot for WAL/DB 
device is worth vs adding more OSD.


Unfortunately I cannot find the benchmark for these kind workload. Has 
anyone ever done this benchmark?


Thank you.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Benchmark WAL/DB on SSD and HDD for RGW RBD CephFS

2020-09-16 Thread Danni Setiawan
Yes, I agree that there are many knob for fine tuning Ceph performance. 
The problem is we don't have data which workload that benefit most from 
WAL/DB in SSD vs in same spinning drive and by how much. Does it really 
help in a cluster that mostly for object storage/RGW? Or may be just 
block storage/RBD workload that benefit most?


IMHO, I think we need some cost-benefit analysis from this because the 
cost placing WAL/DB in SSD is quite noticeable (multiple OSD would be 
fail when SSD fail and capacity reduced).


Thanks.

On 16/09/20 14.45, Janne Johansson wrote:
Den ons 16 sep. 2020 kl 06:27 skrev Danni Setiawan 
mailto:danni.n.setia...@gmail.com>>:


Hi all,

I'm trying to find performance penalty with OSD HDD when using
WAL/DB in
faster device (SSD/NVMe) vs WAL/DB in same device (HDD) for different
workload (RBD, RGW with index bucket in SSD pool, and CephFS with
metadata in SSD pool). I want to know if giving up disk slot for
WAL/DB
device is worth vs adding more OSD.

Unfortunately I cannot find the benchmark for these kind workload.
Has
anyone ever done this benchmark?


I think this probably is a too vague and broad question. If you ask
"will my cluster handle far more write iops if I have WAL/DB (or journal)
 on SSD/NVME instead of on the same drive as the data", then almost 
everyone
will agree that yes, flash WAL/DB will make your writes (and 
recoveries) lots
quicker, since NVME/SSD will do anything from 10x to 100x the amount 
of small
writes per second than the best spin-HDDs. But how this will affect 
any one single
end-user experience behind S3 or CephFS without diving into a ton of 
implementation
details like "how much ram cache does the MDS have for cephfs, how 
many RGWs
and S3 streams are you using in parallel in order to speed up S3/RGW 
operations"

will be very hard to say in pure numbers.

Also, even if flash devices are "only" used for speeding up writes, 
normal clusters see a lot
of mixed IO so if writes theoretically take 0ms, you get lots more 
free time to do reads
on the HDDs, and reads often can be accelerated with RAM caches in 
various places.


So like any other storage system, if you put a flash device in front 
of the spinners you
will see improvements, especially for many small write ops, but if 
your use case consists

of "copy these 100 10G-images to this pool every night" or
"every hour we unzip the sources to a large program and checksum the files
 and then clean the directory" will have a large impact on how 
flash helps your

cluster.

Also, more boxes add more performance in more ways than just "more 
disk", every extra
cpu, every G ram, every extra network port means the overall perf of 
the cluster goes up
by sharing the total load better. This will not show up in simple 
one-threaded tests but as

you get 2-5-10-100 active clients doing IO it will be noticeable.

--
May the most significant bit of your life be positive.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io