[ceph-users] Write access delay after OSD & Mon lost

2020-10-06 Thread Mathieu Dupré
Hi everybody,
Our need is to do VM failover using an image disk over RBD to avoid data 
loss.We want to limit the downtime as much as
possible.
We have: - Two hypervisors with a Ceph Monitor and a Ceph OSD. - A third 
machine with a Ceph Monitor and a Ceph
Manager. 
VM are running over qemu.The VM disks are on a "replicated" rbd pool formed by 
the two OSDs.Ceph version:
NautilusDistribution: Yocto Zeus
The following test is performed: we electrically turn off one hypervisor (and 
therefore a Ceph Monitor and a Ceph OSD),
which causes its VMs to switch to the second hypervisor.
My main issue is that the mount time of a partition in rw is very slow in the 
case of a failover (after the loss of an
OSD its monitor).
With failover we can write on the device after ~25s:[   25.609074] EXT4-fs 
(vda3): mounted filesystem with ordered data
mode. Opts: (null)
In normal boot we can write on the device after ~4s:[3.087412] EXT4-fs 
(vda3): mounted filesystem with ordered data
mode. Opts: (null)
I wasn't able to reduce this time by tweaking Ceph settings. I am wondering if 
someone could help me on that.
Here is our configuration.
ceph.conf[global]fsid = fa7a17d1-5351-459e-bf0e-07e7edc9a625mon initial 
members =
hypervisor1,hypervisor2,observermon host = 
192.168.217.131,192.168.217.132,192.168.217.133public network =
192.168.217.0/24auth cluster required = cephxauth service required = 
cephxauth client required =
cephxosd journal size = 1024osd pool default size = 2osd pool 
default min size = 1osd crush chooseleaf
type = 1mon osd adjust heartbeat grace = falsemon osd min down 
reporters = 1[mon.hypervisor1]host =
hypervisor1mon addr = 192.168.217.131:6789[mon.hypervisor2]host = 
hypervisor2mon addr =
192.168.217.132:6789[mon.observer]host = observermon addr = 
192.168.217.133:6789[osd.0]host =
hypervisor1public_addr = 192.168.217.131cluster_addr = 
192.168.217.131[osd.1]host =
hypervisor2public_addr = 192.168.217.132cluster_addr = 192.168.217.13
# ceph config dump WHOMASK LEVELOPTION   VALUE  
  RO global  advanced
mon_osd_adjust_down_out_interval false   global  advanced
mon_osd_adjust_heartbeat_grace   false   global  advanced
mon_osd_down_out_interval5   global  advanced
mon_osd_report_timeout   4global  advanced
osd_beacon_report_interval   1   global  advanced
osd_heartbeat_grace  2   global  advanced
osd_heartbeat_interval   1   global  advanced
osd_mon_ack_timeout  1.00global  advanced
osd_mon_heartbeat_interval   2   global  advanced 
osd_mon_report_interval  3 
Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Write access delay after OSD & Mon lost

2020-10-06 Thread Marc Roos



I think I do not understand you completely. How long does a live 
migration take? If I do virsh migrate with vm's on librbd it is a few 
seconds. I guess this is mainly caused by copying the ram to the other 
host. 
Any time more this takes in case of a host failure, is related to time 
out settings, failure detection and locks being released etc.

 


-Original Message-
From: Mathieu Dupré [mailto:mathieu.du...@savoirfairelinux.com] 
Sent: dinsdag 6 oktober 2020 9:40
To: ceph-users@ceph.io
Cc: Bail, Eloi
Subject: [ceph-users] Write access delay after OSD & Mon lost

Hi everybody,
Our need is to do VM failover using an image disk over RBD to avoid data 
loss.We want to limit the downtime as much as possible.
We have: - Two hypervisors with a Ceph Monitor and a Ceph OSD. - A third 
machine with a Ceph Monitor and a Ceph Manager. 
VM are running over qemu.The VM disks are on a "replicated" rbd pool 
formed by the two OSDs.Ceph version:
NautilusDistribution: Yocto Zeus
The following test is performed: we electrically turn off one hypervisor 
(and therefore a Ceph Monitor and a Ceph OSD), which causes its VMs to 
switch to the second hypervisor.
My main issue is that the mount time of a partition in rw is very slow 
in the case of a failover (after the loss of an OSD its monitor).
With failover we can write on the device after ~25s:[   25.609074] 
EXT4-fs (vda3): mounted filesystem with ordered data
mode. Opts: (null)
In normal boot we can write on the device after ~4s:[3.087412] 
EXT4-fs (vda3): mounted filesystem with ordered data
mode. Opts: (null)
I wasn't able to reduce this time by tweaking Ceph settings. I am 
wondering if someone could help me on that.
Here is our configuration.
ceph.conf[global]fsid = fa7a17d1-5351-459e-bf0e-07e7edc9a625mon 
initial members =
hypervisor1,hypervisor2,observermon host = 
192.168.217.131,192.168.217.132,192.168.217.133public network =
192.168.217.0/24auth cluster required = cephxauth service 
required = cephxauth client required =
cephxosd journal size = 1024osd pool default size = 2osd 
pool default min size = 1osd crush chooseleaf
type = 1mon osd adjust heartbeat grace = falsemon osd min down 
reporters = 1[mon.hypervisor1]host =
hypervisor1mon addr = 192.168.217.131:6789[mon.hypervisor2]host 
= hypervisor2mon addr =
192.168.217.132:6789[mon.observer]host = observermon addr = 
192.168.217.133:6789[osd.0]host =
hypervisor1public_addr = 192.168.217.131cluster_addr = 
192.168.217.131[osd.1]host =
hypervisor2public_addr = 192.168.217.132cluster_addr = 
192.168.217.13
# ceph config dump WHOMASK LEVELOPTION   
VALUERO global  advanced
mon_osd_adjust_down_out_interval false   global  advanced
mon_osd_adjust_heartbeat_grace   false   global  advanced
mon_osd_down_out_interval5   global  advanced
mon_osd_report_timeout   4global  advanced
osd_beacon_report_interval   1   global  advanced
osd_heartbeat_grace  2   global  advanced
osd_heartbeat_interval   1   global  advanced
osd_mon_ack_timeout  1.00global  advanced
osd_mon_heartbeat_interval   2   global  advanced 
osd_mon_report_interval  3 
Thanks
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Hi all,

We have a Ceph cluster which has been expanded from 10 to 16 nodes.
Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
Most disks (except NVMe's) are 16TB large.

The expansion of 16 nodes went ok, but we've configured the system to
prevent auto balance towards the new disks (weight was set to 0) so we
could control the expansion.

We started adding 6 disks last week (1 disk on each new node) which didn't
give a lot of issues.
When the Ceph status indicated the PG degraded was almost finished, we've
added 2 disks on each node again.

All seemed to go fine, till yesterday morning... IOs towards the system
were slowing down.

Diving onto the nodes we could see that the OSD daemons are consuming the
CPU power, resulting in average CPU loads going near 10 (!).

The RGWs nor monitors nor other involved servers are having CPU issues
(except for the management server which is fighting with Prometheus), so
it's latency seems to be related to the ODS hosts.
All of the hosts are interconnected with 25Gbit connections, no bottlenecks
are reached on the network either.

Important piece of information: We are using erasure coding (6/3), and we
do have a lot of small files...
The current health detail indicates degraded health redundancy where
1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg undersized).

Diving into the historic ops of an OSD we can see that the main latency is
found between the event "queued_for_pg" and "reached_pg". (Averaging +/- 3
secs)

As the system load is quite high I assume the systems are busy
recalculating the code chunks for using the new disks we've added (though
not sure), but I was wondering how I can better fine tune the system or
pinpoint the exact bottle neck.
Latency towards the disks doesn't seem an issue at first sight...

We are running Ceph 14.2.11

Who can give me some thoughts on how I can better pinpoint the bottle neck?

Thanks

Kristof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Hi Anthony,

Thnx for the reply

Average values:
User: 3.5
Idle: 78.4
Wait: 20
System: 1.2

/K.

Op di 6 okt. 2020 om 10:18 schreef Anthony D'Atri :

>
>
> >
> > Diving onto the nodes we could see that the OSD daemons are consuming the
> > CPU power, resulting in average CPU loads going near 10 (!).
>
>
> FWIW, the load average doesn’t really tell you much on a multi-core
> system.  I’ve run 24x SSD OSD nodes with load averages routinely >30 that
> hummed along just fine.
>
> What are the percentages for user/idle/wait/system ?
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Thanks to @Anthony:

Diving further I see that I probably was blinded by the CPU load...
I see that some disks are very slow (so my first observations were
incorrect), and the latency seen using iostat seems more or less the same
as what we see in the dump_historic_ops. (+ 3s for r_await)

So, it looks like a few OSDs are causing a bottleneck in the whole system.

I'm now wondering what my options are to improve the performance... The
main goal is to use the system again, and make sure write operations are
not affected.

- Putting weight on 0 for the slow OSDs (temporary)? This way they recovery
can go on but new files are not written to that disk?
- 

Still investigating...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov

Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB ones) 
utilization along with slow  ops?



Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:

Hi all,

We have a Ceph cluster which has been expanded from 10 to 16 nodes.
Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
Most disks (except NVMe's) are 16TB large.

The expansion of 16 nodes went ok, but we've configured the system to
prevent auto balance towards the new disks (weight was set to 0) so we
could control the expansion.

We started adding 6 disks last week (1 disk on each new node) which didn't
give a lot of issues.
When the Ceph status indicated the PG degraded was almost finished, we've
added 2 disks on each node again.

All seemed to go fine, till yesterday morning... IOs towards the system
were slowing down.

Diving onto the nodes we could see that the OSD daemons are consuming the
CPU power, resulting in average CPU loads going near 10 (!).

The RGWs nor monitors nor other involved servers are having CPU issues
(except for the management server which is fighting with Prometheus), so
it's latency seems to be related to the ODS hosts.
All of the hosts are interconnected with 25Gbit connections, no bottlenecks
are reached on the network either.

Important piece of information: We are using erasure coding (6/3), and we
do have a lot of small files...
The current health detail indicates degraded health redundancy where
1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg undersized).

Diving into the historic ops of an OSD we can see that the main latency is
found between the event "queued_for_pg" and "reached_pg". (Averaging +/- 3
secs)

As the system load is quite high I assume the systems are busy
recalculating the code chunks for using the new disks we've added (though
not sure), but I was wondering how I can better fine tune the system or
pinpoint the exact bottle neck.
Latency towards the disks doesn't seem an issue at first sight...

We are running Ceph 14.2.11

Who can give me some thoughts on how I can better pinpoint the bottle neck?

Thanks

Kristof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Yes, some disks are spiking near 100%... The delay I see with the iostat
(r_await) seems to be synchronised with the delays between queued_for_pg
and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is also
128kb (os level) (for spinner disks). As we are dealing with smaller files,
this might also lead to a decrease of the performance.

I'm still investigating, but I'm wondering if the system is also reading
from disk for finding the KV pairs.



Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov :

> Hi Kristof,
>
> are you seeing high (around 100%) OSDs' disks (main or DB ones)
> utilization along with slow  ops?
>
>
> Thanks,
>
> Igor
>
> On 10/6/2020 11:09 AM, Kristof Coucke wrote:
> > Hi all,
> >
> > We have a Ceph cluster which has been expanded from 10 to 16 nodes.
> > Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
> > Most disks (except NVMe's) are 16TB large.
> >
> > The expansion of 16 nodes went ok, but we've configured the system to
> > prevent auto balance towards the new disks (weight was set to 0) so we
> > could control the expansion.
> >
> > We started adding 6 disks last week (1 disk on each new node) which
> didn't
> > give a lot of issues.
> > When the Ceph status indicated the PG degraded was almost finished, we've
> > added 2 disks on each node again.
> >
> > All seemed to go fine, till yesterday morning... IOs towards the system
> > were slowing down.
> >
> > Diving onto the nodes we could see that the OSD daemons are consuming the
> > CPU power, resulting in average CPU loads going near 10 (!).
> >
> > The RGWs nor monitors nor other involved servers are having CPU issues
> > (except for the management server which is fighting with Prometheus), so
> > it's latency seems to be related to the ODS hosts.
> > All of the hosts are interconnected with 25Gbit connections, no
> bottlenecks
> > are reached on the network either.
> >
> > Important piece of information: We are using erasure coding (6/3), and we
> > do have a lot of small files...
> > The current health detail indicates degraded health redundancy where
> > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
> undersized).
> >
> > Diving into the historic ops of an OSD we can see that the main latency
> is
> > found between the event "queued_for_pg" and "reached_pg". (Averaging +/-
> 3
> > secs)
> >
> > As the system load is quite high I assume the systems are busy
> > recalculating the code chunks for using the new disks we've added (though
> > not sure), but I was wondering how I can better fine tune the system or
> > pinpoint the exact bottle neck.
> > Latency towards the disks doesn't seem an issue at first sight...
> >
> > We are running Ceph 14.2.11
> >
> > Who can give me some thoughts on how I can better pinpoint the bottle
> neck?
> >
> > Thanks
> >
> > Kristof
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Write access delay after OSD & Mon lost

2020-10-06 Thread Maged Mokhtar

IF an OSD is lost, it will be detected after

osd heartbeat grace = 20   +
osd heartbeat interval = 5

ie 25 sec by default, which is what you see. During this time client io 
will block, after this the OSD is flagged as down and a new OSD map is 
issued which the client will use to re-direct the blocked io.


You can lower it, but 25 sec is a reasonable failover time, the risk of 
lowering it is that the down detection can be over sensitive and gets 
triggered during heavy load which can lead to cluster instability. 
Everyone would like to have quicker failover, but the default values 
were set for a reason.


Cheers /Maged

On 06/10/2020 09:40, Mathieu Dupré wrote:

Hi everybody,
Our need is to do VM failover using an image disk over RBD to avoid data 
loss.We want to limit the downtime as much as
possible.
We have: - Two hypervisors with a Ceph Monitor and a Ceph OSD. - A third 
machine with a Ceph Monitor and a Ceph
Manager.
VM are running over qemu.The VM disks are on a "replicated" rbd pool formed by 
the two OSDs.Ceph version:
NautilusDistribution: Yocto Zeus
The following test is performed: we electrically turn off one hypervisor (and 
therefore a Ceph Monitor and a Ceph OSD),
which causes its VMs to switch to the second hypervisor.
My main issue is that the mount time of a partition in rw is very slow in the 
case of a failover (after the loss of an
OSD its monitor).
With failover we can write on the device after ~25s:[   25.609074] EXT4-fs 
(vda3): mounted filesystem with ordered data
mode. Opts: (null)
In normal boot we can write on the device after ~4s:[3.087412] EXT4-fs 
(vda3): mounted filesystem with ordered data
mode. Opts: (null)
I wasn't able to reduce this time by tweaking Ceph settings. I am wondering if 
someone could help me on that.
Here is our configuration.
ceph.conf[global]fsid = fa7a17d1-5351-459e-bf0e-07e7edc9a625mon initial 
members =
hypervisor1,hypervisor2,observermon host = 
192.168.217.131,192.168.217.132,192.168.217.133public network =
192.168.217.0/24auth cluster required = cephxauth service required = 
cephxauth client required =
cephxosd journal size = 1024osd pool default size = 2osd pool 
default min size = 1osd crush chooseleaf
type = 1mon osd adjust heartbeat grace = falsemon osd min down 
reporters = 1[mon.hypervisor1]host =
hypervisor1mon addr = 192.168.217.131:6789[mon.hypervisor2]host = 
hypervisor2mon addr =
192.168.217.132:6789[mon.observer]host = observermon addr = 
192.168.217.133:6789[osd.0]host =
hypervisor1public_addr = 192.168.217.131cluster_addr = 
192.168.217.131[osd.1]host =
hypervisor2public_addr = 192.168.217.132cluster_addr = 192.168.217.13
# ceph config dump WHOMASK LEVELOPTION   VALUE  
  RO global  advanced
mon_osd_adjust_down_out_interval false   global  advanced
mon_osd_adjust_heartbeat_grace   false   global  advanced
mon_osd_down_out_interval5   global  advanced
mon_osd_report_timeout   4global  advanced
osd_beacon_report_interval   1   global  advanced
osd_heartbeat_grace  2   global  advanced
osd_heartbeat_interval   1   global  advanced
osd_mon_ack_timeout  1.00global  advanced
osd_mon_heartbeat_interval   2   global  advanced 
osd_mon_report_interval  3
Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Another strange thing is going on:

No client software is using the system any longer, so we would expect that
all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the PGs
that are being fixed.

So, something is heavily using the disk, but I can't find the process
immediately. I've read something that there can be old client processes
that keep on connecting to an OSD for retrieving data for a specific PG
while that PG is no longer available on that disk.


Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke :

> Yes, some disks are spiking near 100%... The delay I see with the iostat
> (r_await) seems to be synchronised with the delays between queued_for_pg
> and reached_pg events.
> The NVMe disks are not spiking, just the spinner disks.
>
> I know the rocksdb is only partial on the NVMe. The read-ahead is also
> 128kb (os level) (for spinner disks). As we are dealing with smaller files,
> this might also lead to a decrease of the performance.
>
> I'm still investigating, but I'm wondering if the system is also reading
> from disk for finding the KV pairs.
>
>
>
> Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov :
>
>> Hi Kristof,
>>
>> are you seeing high (around 100%) OSDs' disks (main or DB ones)
>> utilization along with slow  ops?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 10/6/2020 11:09 AM, Kristof Coucke wrote:
>> > Hi all,
>> >
>> > We have a Ceph cluster which has been expanded from 10 to 16 nodes.
>> > Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
>> > Most disks (except NVMe's) are 16TB large.
>> >
>> > The expansion of 16 nodes went ok, but we've configured the system to
>> > prevent auto balance towards the new disks (weight was set to 0) so we
>> > could control the expansion.
>> >
>> > We started adding 6 disks last week (1 disk on each new node) which
>> didn't
>> > give a lot of issues.
>> > When the Ceph status indicated the PG degraded was almost finished,
>> we've
>> > added 2 disks on each node again.
>> >
>> > All seemed to go fine, till yesterday morning... IOs towards the system
>> > were slowing down.
>> >
>> > Diving onto the nodes we could see that the OSD daemons are consuming
>> the
>> > CPU power, resulting in average CPU loads going near 10 (!).
>> >
>> > The RGWs nor monitors nor other involved servers are having CPU issues
>> > (except for the management server which is fighting with Prometheus), so
>> > it's latency seems to be related to the ODS hosts.
>> > All of the hosts are interconnected with 25Gbit connections, no
>> bottlenecks
>> > are reached on the network either.
>> >
>> > Important piece of information: We are using erasure coding (6/3), and
>> we
>> > do have a lot of small files...
>> > The current health detail indicates degraded health redundancy where
>> > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
>> undersized).
>> >
>> > Diving into the historic ops of an OSD we can see that the main latency
>> is
>> > found between the event "queued_for_pg" and "reached_pg". (Averaging
>> +/- 3
>> > secs)
>> >
>> > As the system load is quite high I assume the systems are busy
>> > recalculating the code chunks for using the new disks we've added
>> (though
>> > not sure), but I was wondering how I can better fine tune the system or
>> > pinpoint the exact bottle neck.
>> > Latency towards the disks doesn't seem an issue at first sight...
>> >
>> > We are running Ceph 14.2.11
>> >
>> > Who can give me some thoughts on how I can better pinpoint the bottle
>> neck?
>> >
>> > Thanks
>> >
>> > Kristof
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Consul as load balancer

2020-10-06 Thread Szabo, Istvan (Agoda)
Hi,


Is there anybody tried consul as a load balancer?

Any experience?


Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS user mapping

2020-10-06 Thread René Bartsch
Hi,

is there any documentation about mapping usernames, user-ids,
groupnames and group-ids between hosts sharing the same CephFS storage?

Thanx for any hint,

Renne

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Massive Mon DB Size with noout on 14.2.11

2020-10-06 Thread Marc Roos
 
Ok thanks, very clear, I am also indeed within this range.


-Original Message-
Subject: Re: [ceph-users] Re: Massive Mon DB Size with noout on 14.2.11

The important metric is the difference between these two values:

# ceph report | grep osdmap | grep committed report 3324953770
"osdmap_first_committed": 3441952,
"osdmap_last_committed": 3442452,

The mon stores osdmaps on disk, and trims the older versions whenever 
the PGs are clean. Trimming brings the osdmap_first_committed to be 
closer to osdmap_last_committed.
In a cluster with no PGs backfilling or recovering, the mon should trim 
that difference to be within 500-750 epochs.

If there are any PGs backfilling or recovering, then the mon will not 
trim beyond the osdmap epoch when the pools were clean.

So if you are accumulating gigabytes of data in the mon dir, it suggests 
that you have unclean PGs/Pools.

Cheers, dan




On Fri, Oct 2, 2020 at 4:14 PM Marc Roos  
wrote:
>
>
> Does this also count if your cluster is not healthy because of errors 
> like '2 pool(s) have no replicas configured'
> I sometimes use these pools for testing, they are empty.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov
I presume that this might be caused by massive KV data removal which was 
initiated after(or during) data rebalance. We've seen multiple complains 
about RocksDB's performance negatively affected by pool/pg removal. And 
I expect data rebalance might suffer from the same...


You might want to run manual DB compaction using ceph-kvstore-tool for 
every affected OSD to try to work around the issue. This would rather 
help just temporarily if data removal is still ongoing though.



On 10/6/2020 12:41 PM, Kristof Coucke wrote:
Yes, some disks are spiking near 100%... The delay I see with the 
iostat (r_await) seems to be synchronised with the delays between 
queued_for_pg and reached_pg events.

The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is also 
128kb (os level) (for spinner disks). As we are dealing with smaller 
files, this might also lead to a decrease of the performance.


Can you share the amount of DB data spilled over to spinners? You can 
learn this from "bluefs" section in performance counters dump...



I'm still investigating, but I'm wondering if the system is also 
reading from disk for finding the KV pairs.




Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov >:


Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow  ops?


Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:
> Hi all,
>
> We have a Ceph cluster which has been expanded from 10 to 16 nodes.
> Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
> Most disks (except NVMe's) are 16TB large.
>
> The expansion of 16 nodes went ok, but we've configured the
system to
> prevent auto balance towards the new disks (weight was set to 0)
so we
> could control the expansion.
>
> We started adding 6 disks last week (1 disk on each new node)
which didn't
> give a lot of issues.
> When the Ceph status indicated the PG degraded was almost
finished, we've
> added 2 disks on each node again.
>
> All seemed to go fine, till yesterday morning... IOs towards the
system
> were slowing down.
>
> Diving onto the nodes we could see that the OSD daemons are
consuming the
> CPU power, resulting in average CPU loads going near 10 (!).
>
> The RGWs nor monitors nor other involved servers are having CPU
issues
> (except for the management server which is fighting with
Prometheus), so
> it's latency seems to be related to the ODS hosts.
> All of the hosts are interconnected with 25Gbit connections, no
bottlenecks
> are reached on the network either.
>
> Important piece of information: We are using erasure coding
(6/3), and we
> do have a lot of small files...
> The current health detail indicates degraded health redundancy where
> 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
undersized).
>
> Diving into the historic ops of an OSD we can see that the main
latency is
> found between the event "queued_for_pg" and "reached_pg".
(Averaging +/- 3
> secs)
>
> As the system load is quite high I assume the systems are busy
> recalculating the code chunks for using the new disks we've
added (though
> not sure), but I was wondering how I can better fine tune the
system or
> pinpoint the exact bottle neck.
> Latency towards the disks doesn't seem an issue at first sight...
>
> We are running Ceph 14.2.11
>
> Who can give me some thoughts on how I can better pinpoint the
bottle neck?
>
> Thanks
>
> Kristof
> ___
> ceph-users mailing list -- ceph-users@ceph.io

> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov


On 10/6/2020 1:04 PM, Kristof Coucke wrote:

Another strange thing is going on:

No client software is using the system any longer, so we would expect 
that all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the 
PGs that are being fixed.


So, something is heavily using the disk, but I can't find the process 
immediately. I've read something that there can be old client 
processes that keep on connecting to an OSD for retrieving data for a 
specific PG while that PG is no longer available on that disk.




I bet it's rather PG removal happening in background


Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke 
mailto:kristof.cou...@gmail.com>>:


Yes, some disks are spiking near 100%... The delay I see with the
iostat (r_await) seems to be synchronised with the delays between
queued_for_pg and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is
also 128kb (os level) (for spinner disks). As we are dealing with
smaller files, this might also lead to a decrease of the performance.

I'm still investigating, but I'm wondering if the system is also
reading from disk for finding the KV pairs.



Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov mailto:ifedo...@suse.de>>:

Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow  ops?


Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:
> Hi all,
>
> We have a Ceph cluster which has been expanded from 10 to 16
nodes.
> Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
> Most disks (except NVMe's) are 16TB large.
>
> The expansion of 16 nodes went ok, but we've configured the
system to
> prevent auto balance towards the new disks (weight was set
to 0) so we
> could control the expansion.
>
> We started adding 6 disks last week (1 disk on each new
node) which didn't
> give a lot of issues.
> When the Ceph status indicated the PG degraded was almost
finished, we've
> added 2 disks on each node again.
>
> All seemed to go fine, till yesterday morning... IOs towards
the system
> were slowing down.
>
> Diving onto the nodes we could see that the OSD daemons are
consuming the
> CPU power, resulting in average CPU loads going near 10 (!).
>
> The RGWs nor monitors nor other involved servers are having
CPU issues
> (except for the management server which is fighting with
Prometheus), so
> it's latency seems to be related to the ODS hosts.
> All of the hosts are interconnected with 25Gbit connections,
no bottlenecks
> are reached on the network either.
>
> Important piece of information: We are using erasure coding
(6/3), and we
> do have a lot of small files...
> The current health detail indicates degraded health
redundancy where
> 1192911/103387889228 objects are degraded. (1 pg degraded, 1
pg undersized).
>
> Diving into the historic ops of an OSD we can see that the
main latency is
> found between the event "queued_for_pg" and "reached_pg".
(Averaging +/- 3
> secs)
>
> As the system load is quite high I assume the systems are busy
> recalculating the code chunks for using the new disks we've
added (though
> not sure), but I was wondering how I can better fine tune
the system or
> pinpoint the exact bottle neck.
> Latency towards the disks doesn't seem an issue at first
sight...
>
> We are running Ceph 14.2.11
>
> Who can give me some thoughts on how I can better pinpoint
the bottle neck?
>
> Thanks
>
> Kristof
> ___
> ceph-users mailing list -- ceph-users@ceph.io

> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Is there a way that I can check if this process is causing performance
issues?
Can I check somehow if this process is causing the issue?


Op di 6 okt. 2020 om 13:05 schreef Igor Fedotov :

>
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>
> Another strange thing is going on:
>
> No client software is using the system any longer, so we would expect that
> all IOs are related to the recovery (fixing of the degraded PG).
> However, the disks that are reaching high IO are not a member of the PGs
> that are being fixed.
>
> So, something is heavily using the disk, but I can't find the process
> immediately. I've read something that there can be old client processes
> that keep on connecting to an OSD for retrieving data for a specific PG
> while that PG is no longer available on that disk.
>
>
> I bet it's rather PG removal happening in background
>
>
> Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke <
> kristof.cou...@gmail.com>:
>
>> Yes, some disks are spiking near 100%... The delay I see with the iostat
>> (r_await) seems to be synchronised with the delays between queued_for_pg
>> and reached_pg events.
>> The NVMe disks are not spiking, just the spinner disks.
>>
>> I know the rocksdb is only partial on the NVMe. The read-ahead is also
>> 128kb (os level) (for spinner disks). As we are dealing with smaller files,
>> this might also lead to a decrease of the performance.
>>
>> I'm still investigating, but I'm wondering if the system is also reading
>> from disk for finding the KV pairs.
>>
>>
>>
>> Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov :
>>
>>> Hi Kristof,
>>>
>>> are you seeing high (around 100%) OSDs' disks (main or DB ones)
>>> utilization along with slow  ops?
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>> On 10/6/2020 11:09 AM, Kristof Coucke wrote:
>>> > Hi all,
>>> >
>>> > We have a Ceph cluster which has been expanded from 10 to 16 nodes.
>>> > Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
>>> > Most disks (except NVMe's) are 16TB large.
>>> >
>>> > The expansion of 16 nodes went ok, but we've configured the system to
>>> > prevent auto balance towards the new disks (weight was set to 0) so we
>>> > could control the expansion.
>>> >
>>> > We started adding 6 disks last week (1 disk on each new node) which
>>> didn't
>>> > give a lot of issues.
>>> > When the Ceph status indicated the PG degraded was almost finished,
>>> we've
>>> > added 2 disks on each node again.
>>> >
>>> > All seemed to go fine, till yesterday morning... IOs towards the system
>>> > were slowing down.
>>> >
>>> > Diving onto the nodes we could see that the OSD daemons are consuming
>>> the
>>> > CPU power, resulting in average CPU loads going near 10 (!).
>>> >
>>> > The RGWs nor monitors nor other involved servers are having CPU issues
>>> > (except for the management server which is fighting with Prometheus),
>>> so
>>> > it's latency seems to be related to the ODS hosts.
>>> > All of the hosts are interconnected with 25Gbit connections, no
>>> bottlenecks
>>> > are reached on the network either.
>>> >
>>> > Important piece of information: We are using erasure coding (6/3), and
>>> we
>>> > do have a lot of small files...
>>> > The current health detail indicates degraded health redundancy where
>>> > 1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg
>>> undersized).
>>> >
>>> > Diving into the historic ops of an OSD we can see that the main
>>> latency is
>>> > found between the event "queued_for_pg" and "reached_pg". (Averaging
>>> +/- 3
>>> > secs)
>>> >
>>> > As the system load is quite high I assume the systems are busy
>>> > recalculating the code chunks for using the new disks we've added
>>> (though
>>> > not sure), but I was wondering how I can better fine tune the system or
>>> > pinpoint the exact bottle neck.
>>> > Latency towards the disks doesn't seem an issue at first sight...
>>> >
>>> > We are running Ceph 14.2.11
>>> >
>>> > Who can give me some thoughts on how I can better pinpoint the bottle
>>> neck?
>>> >
>>> > Thanks
>>> >
>>> > Kristof
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS user mapping

2020-10-06 Thread 胡 玮文
Hi,

I’m facing this issue, too. I haven’t found any satisfying mapping solution so 
far. Now I’m considering deploying FreeIPA to unify the uid and gid on every 
host. CephFS does not store user names and group names in the file system as 
far as I know.

> On Oct 6, 2020, at 18:49, René Bartsch  
> wrote:
> 
> Hi,
> 
> is there any documentation about mapping usernames, user-ids,
> groupnames and group-ids between hosts sharing the same CephFS storage?
> 
> Thanx for any hint,
> 
> Renne
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS user mapping

2020-10-06 Thread thoralf schulze
hi rene,

On 10/6/20 12:48 PM, René Bartsch wrote:
> is there any documentation about mapping usernames, user-ids,
> groupnames and group-ids between hosts sharing the same CephFS storage?

i guess this is a bit outside of the scope of ceph … as with every
distributed environment, basically you'll have to make sure that
numerical uids/gids get resolved to identical user/group names on every
host involved. an appropriate name service switch configuration with a
suitable backend (pam_ldap, winbind, …) is the way to go.

hth,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS user mapping

2020-10-06 Thread Robert Sander
Hi,

On 06.10.20 12:48, René Bartsch wrote:

> is there any documentation about mapping usernames, user-ids,
> groupnames and group-ids between hosts sharing the same CephFS storage?

CephFS is only recording the numeric user ID and group ID in the
directory entry. It is up to the client to map that onto user name and
group name.

What you use for consistent mappings between your CephFS clients is up
to you. It could be NIS, libnss-ldap, winbind (Active Directory) or any
other method that keeps the passwd and group files in sync.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov
Unfortunately currently available Ceph releases lack any means to 
monitor KV data removal. The only way is to set debug_bluestore to 20 
(for a short period of time, e.g. 1 min) and inspect OSD log for 
_remove/_do_remove/_omap_clear calls. Plenty of them within the 
inspected period means ongoing  removals.


A weak proof of the hypothesis would be having non-zero "numpg_removing" 
performance counter...



On 10/6/2020 2:06 PM, Kristof Coucke wrote:
Is there a way that I can check if this process is causing performance 
issues?

Can I check somehow if this process is causing the issue?


Op di 6 okt. 2020 om 13:05 schreef Igor Fedotov >:



On 10/6/2020 1:04 PM, Kristof Coucke wrote:

Another strange thing is going on:

No client software is using the system any longer, so we would
expect that all IOs are related to the recovery (fixing of the
degraded PG).
However, the disks that are reaching high IO are not a member of
the PGs that are being fixed.

So, something is heavily using the disk, but I can't find the
process immediately. I've read something that there can be old
client processes that keep on connecting to an OSD for retrieving
data for a specific PG while that PG is no longer available on
that disk.



I bet it's rather PG removal happening in background



Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke
mailto:kristof.cou...@gmail.com>>:

Yes, some disks are spiking near 100%... The delay I see with
the iostat (r_await) seems to be synchronised with the delays
between queued_for_pg and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The
read-ahead is also 128kb (os level) (for spinner disks). As
we are dealing with smaller files, this might also lead to a
decrease of the performance.

I'm still investigating, but I'm wondering if the system is
also reading from disk for finding the KV pairs.



Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov
mailto:ifedo...@suse.de>>:

Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB
ones)
utilization along with slow  ops?


Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:
> Hi all,
>
> We have a Ceph cluster which has been expanded from 10
to 16 nodes.
> Each node has between 14 and 16 OSDs of which 2 are
NVMe disks.
> Most disks (except NVMe's) are 16TB large.
>
> The expansion of 16 nodes went ok, but we've configured
the system to
> prevent auto balance towards the new disks (weight was
set to 0) so we
> could control the expansion.
>
> We started adding 6 disks last week (1 disk on each new
node) which didn't
> give a lot of issues.
> When the Ceph status indicated the PG degraded was
almost finished, we've
> added 2 disks on each node again.
>
> All seemed to go fine, till yesterday morning... IOs
towards the system
> were slowing down.
>
> Diving onto the nodes we could see that the OSD daemons
are consuming the
> CPU power, resulting in average CPU loads going near 10
(!).
>
> The RGWs nor monitors nor other involved servers are
having CPU issues
> (except for the management server which is fighting
with Prometheus), so
> it's latency seems to be related to the ODS hosts.
> All of the hosts are interconnected with 25Gbit
connections, no bottlenecks
> are reached on the network either.
>
> Important piece of information: We are using erasure
coding (6/3), and we
> do have a lot of small files...
> The current health detail indicates degraded health
redundancy where
> 1192911/103387889228 objects are degraded. (1 pg
degraded, 1 pg undersized).
>
> Diving into the historic ops of an OSD we can see that
the main latency is
> found between the event "queued_for_pg" and
"reached_pg". (Averaging +/- 3
> secs)
>
> As the system load is quite high I assume the systems
are busy
> recalculating the code chunks for using the new disks
we've added (though
> not sure), but I was wondering how I can better fine
tune the system or
> pinpoint the exact bottle neck.
> Latency 

[ceph-users] Re: Consul as load balancer

2020-10-06 Thread Szabo, Istvan (Agoda)
Good to know thank you. Why I'm thinking on consul, the user would just request 
the endpoint from consul, so I can remove the single load balancer bottleneck, 
they would go directly to rgw.


From: Janne Johansson 
Sent: Tuesday, October 6, 2020 1:52 PM
To: Szabo, Istvan (Agoda)
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Consul as load balancer

Email received from outside the company. If in doubt don't click links nor open 
attachments!

Den tis 6 okt. 2020 kl 08:37 skrev Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>:
>
> Hi,
> Is there anybody tried consul as a load balancer?
> Any experience?

For rgw, load balancing is quite simple, and I guess almost any LB would work.
The only major thing we have hit is that for AWS4 auth, you need to make sure 
that the requests to the backends* actually use the hostname sent by the 
client, but apart from that, I can't think of a LB that can handle http(s) that 
would not work for S3/RGW.

*) some LBs, when they see a pool of backends as a list of ips like 10.1.2.1, 
10.1.2.2, 10.1.2.3 will make the inner request against http://10.1.2.3:80 
instead of http://mybucket.s3.example.com and since the request hostname is a 
part of the hash used for auth, the rgw will calculate it against 10.1.2.3 but 
the client will of course use 
mybucket.s3.example.com.

--
May the most significant bit of your life be positive.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Janne Johansson
Den tis 6 okt. 2020 kl 11:13 skrev Kristof Coucke :

> I'm now wondering what my options are to improve the performance... The
> main goal is to use the system again, and make sure write operations are
> not affected.
> - Putting weight on 0 for the slow OSDs (temporary)? This way they recovery
> can go on but new files are not written to that disk?
> - 
>

Yes, this sounds like a good idea.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Ok, I did the compact on 1 osd.
The utilization is back to normal, so that's good... Thumbs up to you guys!
Though, one thing I want to get out of the way before adapting the other
OSDs:
When I now get the RocksDb stats, my L1, L2 and L3 are gone:

db_statistics {
"rocksdb_compaction_statistics": "",
"": "",
"": "** Compaction Stats [default] **",
"": "LevelFiles   Size Score Read(GB)  Rn(GB) Rnp1(GB)
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"",
"": "  L0  1/0   968.45 KB   0.2  0.0 0.0  0.0
0.0  0.0   0.0   1.0  0.0105.1  0.01  0.00
10.009   0  0",
"": "  L4   1557/0   98.10 GB   0.4  0.0 0.0  0.0   0.0
 0.0   0.0   0.0  0.0  0.0  0.00  0.00
00.000   0  0",
"": " Sum   1558/0   98.10 GB   0.0  0.0 0.0  0.0   0.0
 0.0   0.0   1.0  0.0105.1  0.01  0.00
10.009   0  0",
"": " Int  0/00.00 KB   0.0  0.0 0.0  0.0   0.0
 0.0   0.0   1.0  0.0105.1  0.01  0.00
10.009   0  0",
"": "",
"": "** Compaction Stats [default] **",
"": "PriorityFiles   Size Score Read(GB)  Rn(GB) Rnp1(GB)
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"---",
"": "User  0/00.00 KB   0.0  0.0 0.0  0.0   0.0
 0.0   0.0   0.0  0.0105.1  0.01  0.00
10.009   0  0",
"": "Uptime(secs): 0.3 total, 0.3 interval",
"": "Flush(GB): cumulative 0.001, interval 0.001",
"": "AddFile(GB): cumulative 0.000, interval 0.000",
"": "AddFile(Total Files): cumulative 0, interval 0",
"": "AddFile(L0 Files): cumulative 0, interval 0",
"": "AddFile(Keys): cumulative 0, interval 0",
"": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB
read, 0.00 MB/s read, 0.0 seconds",
"": "Interval compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB read,
0.00 MB/s read, 0.0 seconds",
"": "Stalls(count): 0 level0_slowdown, 0
level0_slowdown_with_compaction, 0 level0_numfiles, 0
level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0
slowdown for pending_compaction_bytes, 0 memtable_compaction, 0
memtable_slowdown, interval 0 total count",
"": "",
"": "** File Read Latency Histogram By Level [default] **",
"": "** Level 0 read latency histogram (micros):",
"": "Count: 5 Average: 69.2000  StdDev: 85.92",
"": "Min: 0  Median: 1.5000  Max: 201",
"": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9: 201.00
P99.99: 201.00",
"": "--",
"": "[   0,   1 ]2  40.000%  40.000% ",
"": "(   1,   2 ]1  20.000%  60.000% ",
"": "( 110, 170 ]1  20.000%  80.000% ",
"": "( 170, 250 ]1  20.000% 100.000% ",
"": "",
"": "** Level 4 read latency histogram (micros):",
"": "Count: 4664 Average: 0.6895  StdDev: 0.82",
"": "Min: 0  Median: 0.5258  Max: 27",
"": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45 P99.99:
13.83",
"": "--",
"": "[   0,   1 ] 4435  95.090%  95.090%
###",
"": "(   1,   2 ]  149   3.195%  98.285% #",
"": "(   2,   3 ]   55   1.179%  99.464% ",
"": "(   3,   4 ]   12   0.257%  99.721% ",
"": "(   4,   6 ]8   0.172%  99.893% ",
"": "(   6,  10 ]3   0.064%  99.957% ",
"": "(  10,  15 ]2   0.043% 100.000% ",
"": "(  22,  34 ]1   0.021% 100.021% ",
"": "",
"": "",
"": "** DB Stats **",
"": "Uptime(secs): 0.3 total, 0.3 interval",
"": "Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes
per commit group, ingest: 0.00 GB, 0.00 MB/s",
"": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written:
0.00 GB, 0.00 MB/s",
"": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent",
"": "Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per
commit group, ingest: 0.00 MB, 0.00 MB/s",
"": "Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written:
0.00 MB, 0.00 MB/s",
"": "Interval stall: 00:00:0.000 H:M:S, 0.0 percent"
}

We use the NVMe's to store the 

[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Danni Setiawan
We have similar with this issue last week. We have sluggish disk (10TB 
SAS in RAID 0 mode) in half of node which affect performance of cluster. 
These disk has high CPU usage and very high latency. Turns out there is 
a process *patrol read* from RAID card that running automatically every 
week. When we stop patrol read, everything is normal again.
We also running on Ceph 14.2.11. We don't have this issue with previous 
Ceph version and never change setting of patrol read.


Thanks.

On 06/10/20 17.04, Kristof Coucke wrote:

Another strange thing is going on:

No client software is using the system any longer, so we would expect that
all IOs are related to the recovery (fixing of the degraded PG).
However, the disks that are reaching high IO are not a member of the PGs
that are being fixed.

So, something is heavily using the disk, but I can't find the process
immediately. I've read something that there can be old client processes
that keep on connecting to an OSD for retrieving data for a specific PG
while that PG is no longer available on that disk.


Op di 6 okt. 2020 om 11:41 schreef Kristof Coucke 
:
Yes, some disks are spiking near 100%... The delay I see with the iostat
(r_await) seems to be synchronised with the delays between queued_for_pg
and reached_pg events.
The NVMe disks are not spiking, just the spinner disks.

I know the rocksdb is only partial on the NVMe. The read-ahead is also
128kb (os level) (for spinner disks). As we are dealing with smaller files,
this might also lead to a decrease of the performance.

I'm still investigating, but I'm wondering if the system is also reading
from disk for finding the KV pairs.



Op di 6 okt. 2020 om 11:23 schreef Igor Fedotov :


Hi Kristof,

are you seeing high (around 100%) OSDs' disks (main or DB ones)
utilization along with slow  ops?


Thanks,

Igor

On 10/6/2020 11:09 AM, Kristof Coucke wrote:

Hi all,

We have a Ceph cluster which has been expanded from 10 to 16 nodes.
Each node has between 14 and 16 OSDs of which 2 are NVMe disks.
Most disks (except NVMe's) are 16TB large.

The expansion of 16 nodes went ok, but we've configured the system to
prevent auto balance towards the new disks (weight was set to 0) so we
could control the expansion.

We started adding 6 disks last week (1 disk on each new node) which

didn't

give a lot of issues.
When the Ceph status indicated the PG degraded was almost finished,

we've

added 2 disks on each node again.

All seemed to go fine, till yesterday morning... IOs towards the system
were slowing down.

Diving onto the nodes we could see that the OSD daemons are consuming

the

CPU power, resulting in average CPU loads going near 10 (!).

The RGWs nor monitors nor other involved servers are having CPU issues
(except for the management server which is fighting with Prometheus), so
it's latency seems to be related to the ODS hosts.
All of the hosts are interconnected with 25Gbit connections, no

bottlenecks

are reached on the network either.

Important piece of information: We are using erasure coding (6/3), and

we

do have a lot of small files...
The current health detail indicates degraded health redundancy where
1192911/103387889228 objects are degraded. (1 pg degraded, 1 pg

undersized).

Diving into the historic ops of an OSD we can see that the main latency

is

found between the event "queued_for_pg" and "reached_pg". (Averaging

+/- 3

secs)

As the system load is quite high I assume the systems are busy
recalculating the code chunks for using the new disks we've added

(though

not sure), but I was wondering how I can better fine tune the system or
pinpoint the exact bottle neck.
Latency towards the disks doesn't seem an issue at first sight...

We are running Ceph 14.2.11

Who can give me some thoughts on how I can better pinpoint the bottle

neck?

Thanks

Kristof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov
I've seen similar reports after manual compactions as well. But it looks 
like a presentation bug in RocksDB to me.


You can check if all the data is spilled over (as it ought to be for L4) 
in bluefs section of OSD perf counters dump...



On 10/6/2020 3:18 PM, Kristof Coucke wrote:

Ok, I did the compact on 1 osd.
The utilization is back to normal, so that's good... Thumbs up to you 
guys!
Though, one thing I want to get out of the way before adapting the 
other OSDs:

When I now get the RocksDb stats, my L1, L2 and L3 are gone:

db_statistics {
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) 
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "": 
"",
    "": "  L0      1/0   968.45 KB   0.2      0.0     0.0  0.0       
0.0      0.0       0.0   1.0      0.0    105.1  0.01              0.00 
        1    0.009       0      0",
    "": "  L4   1557/0   98.10 GB   0.4      0.0     0.0  0.0       
0.0      0.0       0.0   0.0      0.0      0.0  0.00              0.00 
        0    0.000       0      0",
    "": " Sum   1558/0   98.10 GB   0.0      0.0     0.0  0.0       
0.0      0.0       0.0   1.0      0.0    105.1  0.01              0.00 
        1    0.009       0      0",
    "": " Int      0/0    0.00 KB   0.0      0.0     0.0  0.0       
0.0      0.0       0.0   1.0      0.0    105.1  0.01              0.00 
        1    0.009       0      0",

    "": "",
    "": "** Compaction Stats [default] **",
    "": "Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) 
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "": 
"---",
    "": "User      0/0    0.00 KB   0.0      0.0     0.0  0.0       
0.0      0.0       0.0   0.0      0.0    105.1  0.01              0.00 
        1    0.009       0      0",

    "": "Uptime(secs): 0.3 total, 0.3 interval",
    "": "Flush(GB): cumulative 0.001, interval 0.001",
    "": "AddFile(GB): cumulative 0.000, interval 0.000",
    "": "AddFile(Total Files): cumulative 0, interval 0",
    "": "AddFile(L0 Files): cumulative 0, interval 0",
    "": "AddFile(Keys): cumulative 0, interval 0",
    "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write, 0.00 
GB read, 0.00 MB/s read, 0.0 seconds",
    "": "Interval compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB 
read, 0.00 MB/s read, 0.0 seconds",
    "": "Stalls(count): 0 level0_slowdown, 0 
level0_slowdown_with_compaction, 0 level0_numfiles, 0 
level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 
0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 
memtable_slowdown, interval 0 total count",

    "": "",
    "": "** File Read Latency Histogram By Level [default] **",
    "": "** Level 0 read latency histogram (micros):",
    "": "Count: 5 Average: 69.2000  StdDev: 85.92",
    "": "Min: 0  Median: 1.5000  Max: 201",
    "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9: 201.00 
P99.99: 201.00",

    "": "--",
    "": "[       0,       1 ]        2  40.000%  40.000% ",
    "": "(       1,       2 ]        1  20.000%  60.000% ",
    "": "(     110,     170 ]        1  20.000%  80.000% ",
    "": "(     170,     250 ]        1  20.000% 100.000% ",
    "": "",
    "": "** Level 4 read latency histogram (micros):",
    "": "Count: 4664 Average: 0.6895  StdDev: 0.82",
    "": "Min: 0  Median: 0.5258  Max: 27",
    "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45 
P99.99: 13.83",

    "": "--",
    "": "[       0,       1 ]     4435  95.090%  95.090% 
###",

    "": "(       1,       2 ]      149   3.195%  98.285% #",
    "": "(       2,       3 ]       55   1.179%  99.464% ",
    "": "(       3,       4 ]       12   0.257%  99.721% ",
    "": "(       4,       6 ]        8   0.172%  99.893% ",
    "": "(       6,      10 ]        3   0.064%  99.957% ",
    "": "(      10,      15 ]        2   0.043% 100.000% ",
    "": "(      22,      34 ]        1   0.021% 100.021% ",
    "": "",
    "": "",
    "": "** DB Stats **",
    "": "Uptime(secs): 0.3 total, 0.3 interval",
    "": "Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 
writes per commit group, ingest: 0.00 GB, 0.00 MB/s",
    "": "Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, 
written: 0.00 GB, 0.00 MB/s",

    "": "Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent"

[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Kristof Coucke
Hi Igor and Stefan,

Everything seems okay, so we'll now create a script to automate this on all
the nodes and we will also review the monitoring possibilities.
Thanks for your help, it was a time saver.

Does anyone know if this issue is better handled in the newer versions or
if this is planned in an upcoming release?

My best regards,

Kristof

Op di 6 okt. 2020 om 14:36 schreef Igor Fedotov :

> I've seen similar reports after manual compactions as well. But it looks
> like a presentation bug in RocksDB to me.
>
> You can check if all the data is spilled over (as it ought to be for L4)
> in bluefs section of OSD perf counters dump...
>
>
> On 10/6/2020 3:18 PM, Kristof Coucke wrote:
>
> Ok, I did the compact on 1 osd.
> The utilization is back to normal, so that's good... Thumbs up to you guys!
> Though, one thing I want to get out of the way before adapting the other
> OSDs:
> When I now get the RocksDb stats, my L1, L2 and L3 are gone:
>
> db_statistics {
> "rocksdb_compaction_statistics": "",
> "": "",
> "": "** Compaction Stats [default] **",
> "": "LevelFiles   Size Score Read(GB)  Rn(GB) Rnp1(GB)
> Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
> CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
> "":
> "",
> "": "  L0  1/0   968.45 KB   0.2  0.0 0.0  0.0
> 0.0  0.0   0.0   1.0  0.0105.1  0.01  0.00
> 10.009   0  0",
> "": "  L4   1557/0   98.10 GB   0.4  0.0 0.0  0.0
> 0.0  0.0   0.0   0.0  0.0  0.0  0.00  0.00
> 00.000   0  0",
> "": " Sum   1558/0   98.10 GB   0.0  0.0 0.0  0.0
> 0.0  0.0   0.0   1.0  0.0105.1  0.01  0.00
> 10.009   0  0",
> "": " Int  0/00.00 KB   0.0  0.0 0.0  0.0
> 0.0  0.0   0.0   1.0  0.0105.1  0.01  0.00
> 10.009   0  0",
> "": "",
> "": "** Compaction Stats [default] **",
> "": "PriorityFiles   Size Score Read(GB)  Rn(GB) Rnp1(GB)
> Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
> CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
> "":
> "---",
> "": "User  0/00.00 KB   0.0  0.0 0.0  0.0
> 0.0  0.0   0.0   0.0  0.0105.1  0.01  0.00
> 10.009   0  0",
> "": "Uptime(secs): 0.3 total, 0.3 interval",
> "": "Flush(GB): cumulative 0.001, interval 0.001",
> "": "AddFile(GB): cumulative 0.000, interval 0.000",
> "": "AddFile(Total Files): cumulative 0, interval 0",
> "": "AddFile(L0 Files): cumulative 0, interval 0",
> "": "AddFile(Keys): cumulative 0, interval 0",
> "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB
> read, 0.00 MB/s read, 0.0 seconds",
> "": "Interval compaction: 0.00 GB write, 2.84 MB/s write, 0.00 GB
> read, 0.00 MB/s read, 0.0 seconds",
> "": "Stalls(count): 0 level0_slowdown, 0
> level0_slowdown_with_compaction, 0 level0_numfiles, 0
> level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0
> slowdown for pending_compaction_bytes, 0 memtable_compaction, 0
> memtable_slowdown, interval 0 total count",
> "": "",
> "": "** File Read Latency Histogram By Level [default] **",
> "": "** Level 0 read latency histogram (micros):",
> "": "Count: 5 Average: 69.2000  StdDev: 85.92",
> "": "Min: 0  Median: 1.5000  Max: 201",
> "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9: 201.00
> P99.99: 201.00",
> "": "--",
> "": "[   0,   1 ]2  40.000%  40.000% ",
> "": "(   1,   2 ]1  20.000%  60.000% ",
> "": "( 110, 170 ]1  20.000%  80.000% ",
> "": "( 170, 250 ]1  20.000% 100.000% ",
> "": "",
> "": "** Level 4 read latency histogram (micros):",
> "": "Count: 4664 Average: 0.6895  StdDev: 0.82",
> "": "Min: 0  Median: 0.5258  Max: 27",
> "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45 P99.99:
> 13.83",
> "": "--",
> "": "[   0,   1 ] 4435  95.090%  95.090%
> ###",
> "": "(   1,   2 ]  149   3.195%  98.285% #",
> "": "(   2,   3 ]   55   1.179%  99.464% ",
> "": "(   3,   4 ]   12   0.257%  99.721% ",
> "": "(   4,   6 ]8   0.172%  99.893% ",
> "": "( 

[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Igor Fedotov
I'm working on improving PG removal in master, see: 
https://github.com/ceph/ceph/pull/37496


Hopefully this will help in case of "cleanup after rebalancing" issue 
which you presumably had.



On 10/6/2020 4:24 PM, Kristof Coucke wrote:

Hi Igor and Stefan,

Everything seems okay, so we'll now create a script to automate this 
on all the nodes and we will also review the monitoring possibilities.

Thanks for your help, it was a time saver.

Does anyone know if this issue is better handled in the newer versions 
or if this is planned in an upcoming release?


My best regards,

Kristof

Op di 6 okt. 2020 om 14:36 schreef Igor Fedotov >:


I've seen similar reports after manual compactions as well. But it
looks like a presentation bug in RocksDB to me.

You can check if all the data is spilled over (as it ought to be
for L4) in bluefs section of OSD perf counters dump...


On 10/6/2020 3:18 PM, Kristof Coucke wrote:

Ok, I did the compact on 1 osd.
The utilization is back to normal, so that's good... Thumbs up to
you guys!
Though, one thing I want to get out of the way before adapting
the other OSDs:
When I now get the RocksDb stats, my L1, L2 and L3 are gone:

db_statistics {
    "rocksdb_compaction_statistics": "",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Level    Files   Size     Score Read(GB)  Rn(GB)
Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "":

"",
    "": "  L0      1/0   968.45 KB   0.2      0.0 0.0      0.0  
    0.0      0.0       0.0   1.0  0.0    105.1      0.01        
     0.00         1  0.009       0      0",
    "": "  L4   1557/0   98.10 GB   0.4      0.0 0.0      0.0    
  0.0      0.0       0.0   0.0  0.0      0.0      0.00          
   0.00         0  0.000       0      0",
    "": " Sum   1558/0   98.10 GB   0.0      0.0 0.0      0.0    
  0.0      0.0       0.0   1.0  0.0    105.1      0.01          
   0.00         1  0.009       0      0",
    "": " Int      0/0    0.00 KB   0.0      0.0 0.0      0.0    
  0.0      0.0       0.0   1.0  0.0    105.1      0.01          
   0.00         1  0.009       0      0",
    "": "",
    "": "** Compaction Stats [default] **",
    "": "Priority    Files   Size     Score Read(GB)  Rn(GB)
Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s)
Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
    "":

"---",
    "": "User      0/0    0.00 KB   0.0      0.0 0.0      0.0    
  0.0      0.0       0.0   0.0  0.0    105.1      0.01          
   0.00         1  0.009       0      0",
    "": "Uptime(secs): 0.3 total, 0.3 interval",
    "": "Flush(GB): cumulative 0.001, interval 0.001",
    "": "AddFile(GB): cumulative 0.000, interval 0.000",
    "": "AddFile(Total Files): cumulative 0, interval 0",
    "": "AddFile(L0 Files): cumulative 0, interval 0",
    "": "AddFile(Keys): cumulative 0, interval 0",
    "": "Cumulative compaction: 0.00 GB write, 2.84 MB/s write,
0.00 GB read, 0.00 MB/s read, 0.0 seconds",
    "": "Interval compaction: 0.00 GB write, 2.84 MB/s write,
0.00 GB read, 0.00 MB/s read, 0.0 seconds",
    "": "Stalls(count): 0 level0_slowdown, 0
level0_slowdown_with_compaction, 0 level0_numfiles, 0
level0_numfiles_with_compaction, 0 stop for
pending_compaction_bytes, 0 slowdown for
pending_compaction_bytes, 0 memtable_compaction, 0
memtable_slowdown, interval 0 total count",
    "": "",
    "": "** File Read Latency Histogram By Level [default] **",
    "": "** Level 0 read latency histogram (micros):",
    "": "Count: 5 Average: 69.2000  StdDev: 85.92",
    "": "Min: 0  Median: 1.5000  Max: 201",
    "": "Percentiles: P50: 1.50 P75: 155.00 P99: 201.00 P99.9:
201.00 P99.99: 201.00",
    "": "--",
    "": "[       0,       1 ]        2  40.000%  40.000% ",
    "": "(       1,       2 ]        1  20.000%  60.000% ",
    "": "(     110,     170 ]        1  20.000%  80.000% ",
    "": "(     170,     250 ]        1  20.000% 100.000% ",
    "": "",
    "": "** Level 4 read latency histogram (micros):",
    "": "Count: 4664 Average: 0.6895  StdDev: 0.82",
    "": "Min: 0  Median: 0.5258  Max: 27",
    "": "Percentiles: P50: 0.53 P75: 0.79 P99: 2.61 P99.9: 6.45
P99.99: 13.83",
    "": "

[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Stefan Kooman
On 2020-10-06 13:05, Igor Fedotov wrote:
> 
> On 10/6/2020 1:04 PM, Kristof Coucke wrote:
>> Another strange thing is going on:
>>
>> No client software is using the system any longer, so we would expect
>> that all IOs are related to the recovery (fixing of the degraded PG).
>> However, the disks that are reaching high IO are not a member of the
>> PGs that are being fixed.
>>
>> So, something is heavily using the disk, but I can't find the process
>> immediately. I've read something that there can be old client
>> processes that keep on connecting to an OSD for retrieving data for a
>> specific PG while that PG is no longer available on that disk.
>>
>>
> I bet it's rather PG removal happening in background

^^ This, and probably the accompanying RocksDB housekeeping that goes
with it. As only removing PGs shouldn't be a too big a deal at all.
Especially with very small files (and a lot of them) you probably have a
lot of OMAP / META data, (ceph osd df will tell you).

If that's indeed the case than there is a (way) quicker option to get
out of this situation: offline compacting of the OSDs. This process
happens orders of magnitude faster than when the OSDs are still online.

To check if this hypothesis is true: are the OSD servers under CPU
stress where the PGs were located previously (and not the new hosts)?

Offline compaction per host:

systemctl stop ceph-osd.target

for osd in `ls /var/lib/ceph/osd/`; do (ceph-kvstore-tool bluestore-kv
/var/lib/ceph/osd/$osd compact &);done

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Stefan Kooman
On 2020-10-06 14:18, Kristof Coucke wrote:
> Ok, I did the compact on 1 osd.
> The utilization is back to normal, so that's good... Thumbs up to you guys!

We learned the hard way, but happy to spot the issue and share the info.

> Though, one thing I want to get out of the way before adapting the other
> OSDs:
> When I now get the RocksDb stats, my L1, L2 and L3 are gone:

I guess that they have all been merged now. On a OSD with L0 and L1 I
see L0 disappear after a compact. After a restart, recovery, and then
dumping stats again it's there again. So yeah, it gets created
automatically.

> 
> We use the NVMe's to store the RocksDb, but with the spillover towards
> the spinning drives.
> L4 is intended to be stored on the spinning drives... 
> Will the other levels be created automatically?

Yeah pretty sure that's how RocksDB works, and tested that (see above).

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops on OSDs

2020-10-06 Thread Stefan Kooman
On 2020-10-06 15:27, Igor Fedotov wrote:
> I'm working on improving PG removal in master, see:
> https://github.com/ceph/ceph/pull/37496
> 
> Hopefully this will help in case of "cleanup after rebalancing" issue
> which you presumably had.

That would be great. Does the offline compaction with the
ceph-kvstore-tool follows a completely different removal procedure and
is that why it's that much faster?

@Kristof:
https://github.com/ceph/ceph/commit/93e4c56ecc13560e0dad69aaa67afc3ca053fb4c
is a commit by Wido that would help to enable compaction at OSD (re)start.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI Performance

2020-10-06 Thread DHilsbos
Mark;

Are you suggesting some other means to configure iSCSI targets with Ceph?

If so, how do configure for non-tcmu?

The iSCSI clients are not RBD aware, and I can't really make them RBD aware.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Monday, October 5, 2020 3:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph iSCSI Performance

I don't have super recent results, but we do have some test data from 
last year looking at kernel rbd, rbd-nbd, rbd+tcmu, fuse, etc:


https://docs.google.com/spreadsheets/d/1oJZ036QDbJQgv2gXts1oKKhMOKXrOI2XLTkvlsl9bUs/edit?usp=sharing


Generally speaking going through the tcmu layer was slower than kernel 
rbd or librbd directly (sometimes by quite a bit!).  There was also more 
client side CPU usage per unit performance as well (which makes sense 
since there's additional work being done).  You may be able to get some 
of that performance back with more clients as I do remember there being 
some issues with iodepth and tcmu. The only setup that I remember being 
slower at the time though was rbd-fuse which I don't think is even 
really maintained.


Mark


On 10/5/20 4:43 PM, dhils...@performair.com wrote:
> All;
>
> I've finally gotten around to setting up iSCSI gateways on my primary 
> production cluster, and performance is terrible.
>
> We're talking 1/4 to 1/3 of our current solution.
>
> I see no evidence of network congestion on any involved network link.  I see 
> no evidence CPU or memory being a problem on any involved server (MON / OSD / 
> gateway /client).
>
> What can I look at to tune this, preferably on the iSCSI gateways?
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International, Inc.
> dhils...@performair.com
> www.PerformAir.com
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI Performance

2020-10-06 Thread Mark Nelson

Hi Dominic,


If you can't use kernel rbd I think you'll probably have to deal with 
the higher overhead and lower performance with the tcmu solution.  It's 
possible there might be some things you can tweak at the tcmu layer that 
will improve things, but when I looked at it there simply seemed to be a 
lot of extra work being done to do the translation. YMMV.



Mark


On 10/6/20 12:49 PM, dhils...@performair.com wrote:

Mark;

Are you suggesting some other means to configure iSCSI targets with Ceph?

If so, how do configure for non-tcmu?

The iSCSI clients are not RBD aware, and I can't really make them RBD aware.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: Monday, October 5, 2020 3:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph iSCSI Performance

I don't have super recent results, but we do have some test data from
last year looking at kernel rbd, rbd-nbd, rbd+tcmu, fuse, etc:


https://docs.google.com/spreadsheets/d/1oJZ036QDbJQgv2gXts1oKKhMOKXrOI2XLTkvlsl9bUs/edit?usp=sharing


Generally speaking going through the tcmu layer was slower than kernel
rbd or librbd directly (sometimes by quite a bit!).  There was also more
client side CPU usage per unit performance as well (which makes sense
since there's additional work being done).  You may be able to get some
of that performance back with more clients as I do remember there being
some issues with iodepth and tcmu. The only setup that I remember being
slower at the time though was rbd-fuse which I don't think is even
really maintained.


Mark


On 10/5/20 4:43 PM, dhils...@performair.com wrote:

All;

I've finally gotten around to setting up iSCSI gateways on my primary 
production cluster, and performance is terrible.

We're talking 1/4 to 1/3 of our current solution.

I see no evidence of network congestion on any involved network link.  I see no 
evidence CPU or memory being a problem on any involved server (MON / OSD / 
gateway /client).

What can I look at to tune this, preferably on the iSCSI gateways?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International, Inc.
dhils...@performair.com
www.PerformAir.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI Performance

2020-10-06 Thread Mark Nelson
To be honest I don't really remember, those tests were from a while ago. 
:)  I'm guessing I probably was getting higher throughput with 32 vs 16 
in some of the test cases but didn't need to go up to 64 at that time.  
This was all before various work we've done in bluestore over the past 
year that's improved performance quite a bit in our rbd tests.



Mark


On 10/5/20 6:08 PM, Tecnología CHARNE.NET wrote:

Mark, Why do you use io_depth=32 in fio parameters?

Is there any reason for not choose 16 or 64?

Thanks in advance!



I don't have super recent results, but we do have some test data from last year 
looking at kernel rbd, rbd-nbd, rbd+tcmu, fuse, etc:


https://docs.google.com/spreadsheets/d/1oJZ036QDbJQgv2gXts1oKKhMOKXrOI2XLTkvlsl9bUs/edit?usp=sharing


Generally speaking going through the tcmu layer was slower than kernel rbd or 
librbd directly (sometimes by quite a bit!).  There was also more client side 
CPU usage per unit performance as well (which makes sense since there's 
additional work being done).  You may be able to get some of that performance 
back with more clients as I do remember there being some issues with iodepth 
and tcmu. The only setup that I remember being slower at the time though was 
rbd-fuse which I don't think is even really maintained.


Mark


On 10/5/20 4:43 PM, dhils...@performair.com wrote:

All;

I've finally gotten around to setting up iSCSI gateways on my primary 
production cluster, and performance is terrible.

We're talking 1/4 to 1/3 of our current solution.

I see no evidence of network congestion on any involved network link.  I see no 
evidence CPU or memory being a problem on any involved server (MON / OSD / 
gateway /client).

What can I look at to tune this, preferably on the iSCSI gateways?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International, Inc.
dhils...@performair.com
www.PerformAir.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pool pgp_num not updated

2020-10-06 Thread Mac Wynkoop
Hi everyone,

I'm seeing a similar issue here. Any ideas on this?
Mac Wynkoop,



On Sun, Sep 6, 2020 at 11:09 PM norman  wrote:

> Hi guys,
>
> When I update the pg_num of a pool, I found it not worked(no
> rebalanced), anyone know the reason? Pool's info:
>
> pool 21 'openstack-volumes-rs' replicated size 3 min_size 2 crush_rule
> 21 object_hash rjenkins pg_num 1024 pgp_num 512 pgp_num_target 1024
> autoscale_mode warn last_change 85103 lfor 82044/82044/82044 flags
> hashpspool,nodelete,selfmanaged_snaps stripe_width 0 application rbd
>  removed_snaps [1~1e6,1e8~300,4e9~18,502~3f,542~11,554~1a,56f~1d7]
> pool 22 'openstack-vms-rs' replicated size 3 min_size 2 crush_rule 22
> object_hash rjenkins pg_num 512 pgp_num 512 pg_num_target 256
> pgp_num_target 256 autoscale_mode warn last_change 84769 lfor 0/0/55294
> flags hashpspool,nodelete,selfmanaged_snaps stripe_width 0 application rbd
>
> The pgp_num_target is set, but pgp_num not set.
>
> I have scale out new OSDs and is backfilling before setting the value,
> is it the reason?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pool pgp_num not updated

2020-10-06 Thread Marc Roos
pg_num and pgp_num need to be the same, not?

3.5.1. Set the Number of PGs

To set the number of placement groups in a pool, you must specify the 
number of placement groups at the time you create the pool. See Create a 
Pool for details. Once you set placement groups for a pool, you can 
increase the number of placement groups (but you cannot decrease the 
number of placement groups). To increase the number of placement groups, 
execute the following:

ceph osd pool set {pool-name} pg_num {pg_num}

Once you increase the number of placement groups, you must also increase 
the number of placement groups for placement (pgp_num) before your 
cluster will rebalance. The pgp_num should be equal to the pg_num. To 
increase the number of placement groups for placement, execute the 
following:

ceph osd pool set {pool-name} pgp_num {pgp_num}

https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs

-Original Message-
To: norman
Cc: ceph-users
Subject: [ceph-users] Re: pool pgp_num not updated

Hi everyone,

I'm seeing a similar issue here. Any ideas on this?
Mac Wynkoop,



On Sun, Sep 6, 2020 at 11:09 PM norman  wrote:

> Hi guys,
>
> When I update the pg_num of a pool, I found it not worked(no 
> rebalanced), anyone know the reason? Pool's info:
>
> pool 21 'openstack-volumes-rs' replicated size 3 min_size 2 crush_rule
> 21 object_hash rjenkins pg_num 1024 pgp_num 512 pgp_num_target 1024 
> autoscale_mode warn last_change 85103 lfor 82044/82044/82044 flags 
> hashpspool,nodelete,selfmanaged_snaps stripe_width 0 application rbd
>  removed_snaps 
> [1~1e6,1e8~300,4e9~18,502~3f,542~11,554~1a,56f~1d7]
> pool 22 'openstack-vms-rs' replicated size 3 min_size 2 crush_rule 22 
> object_hash rjenkins pg_num 512 pgp_num 512 pg_num_target 256 
> pgp_num_target 256 autoscale_mode warn last_change 84769 lfor 
> 0/0/55294 flags hashpspool,nodelete,selfmanaged_snaps stripe_width 0 
> application rbd
>
> The pgp_num_target is set, but pgp_num not set.
>
> I have scale out new OSDs and is backfilling before setting the value, 

> is it the reason?
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI Performance

2020-10-06 Thread Maged Mokhtar


You can try PetaSAN  www.petasan.org we use rbd backend by SUSE. It 
works out of the box.


/Maged

On 06/10/2020 19:49, dhils...@performair.com wrote:

Mark;

Are you suggesting some other means to configure iSCSI targets with Ceph?

If so, how do configure for non-tcmu?

The iSCSI clients are not RBD aware, and I can't really make them RBD aware.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: Monday, October 5, 2020 3:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph iSCSI Performance

I don't have super recent results, but we do have some test data from
last year looking at kernel rbd, rbd-nbd, rbd+tcmu, fuse, etc:


https://docs.google.com/spreadsheets/d/1oJZ036QDbJQgv2gXts1oKKhMOKXrOI2XLTkvlsl9bUs/edit?usp=sharing


Generally speaking going through the tcmu layer was slower than kernel
rbd or librbd directly (sometimes by quite a bit!).  There was also more
client side CPU usage per unit performance as well (which makes sense
since there's additional work being done).  You may be able to get some
of that performance back with more clients as I do remember there being
some issues with iodepth and tcmu. The only setup that I remember being
slower at the time though was rbd-fuse which I don't think is even
really maintained.


Mark


On 10/5/20 4:43 PM, dhils...@performair.com wrote:

All;

I've finally gotten around to setting up iSCSI gateways on my primary 
production cluster, and performance is terrible.

We're talking 1/4 to 1/3 of our current solution.

I see no evidence of network congestion on any involved network link.  I see no 
evidence CPU or memory being a problem on any involved server (MON / OSD / 
gateway /client).

What can I look at to tune this, preferably on the iSCSI gateways?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International, Inc.
dhils...@performair.com
www.PerformAir.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Wipe an Octopus install

2020-10-06 Thread Samuel Taylor Liston
Wondering if anyone knows or has put together a way to wipe an Octopus 
install?  I’ve looked for documentation on the process, but if it exists, I 
haven’t found it yet.  I’m going through some test installs - working through 
the ins and outs of cephadm and containers and would love an easy way to tear 
things down and start over.
In previous releases managed through ceph-deploy there were three very 
convenient commands that nuked the world.  I am looking for something as 
complete for Octopus.
Thanks,
 
Sam Liston (sam.lis...@utah.edu)
==
Center for High Performance Computing - Univ. of Utah
155 S. 1452 E. Rm 405
Salt Lake City, Utah 84112 (801)232-6932
==



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io