[ceph-users] Re: Please discuss about Slow Peering

Anthony D'Atri Thu, 16 May 2024 10:59:36 -0700

If using jumbo frames, also ensure that they're consistently enabled on all OS 
instances and network devices.


> On May 16, 2024, at 09:30, Frank Schilder <fr...@dtu.dk> wrote:
> 
> This is a long shot: if you are using octopus, you might be hit by this 
> pglog-dup problem: 
> https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't 
> mention slow peering explicitly in the blog, but its also a consequence 
> because the up+acting OSDs need to go through the PG_log during peering.
> 
> We are also using octopus and I'm not sure if we have ever seen slow ops 
> caused by peering alone. It usually happens when a disk cannot handle load 
> under peering. We have, unfortunately, disks that show random latency spikes 
> (firmware update pending). You can try to monitor OPS latencies for your 
> drives when peering and look for something that sticks out. People on this 
> list were reporting quite bad results for certain infamous NVMe brands. If 
> you state your model numbers, someone else might recognize it.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: 서민우 <smw940...@gmail.com>
> Sent: Thursday, May 16, 2024 7:39 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Please discuss about Slow Peering
> 
> Env:
> - OS: Ubuntu 20.04
> - Ceph Version: Octopus 15.0.0.1
> - OSD Disk: 2.9TB NVMe
> - BlockStorage (Replication 3)
> 
> Symptom:
> - Peering when OSD's node up is very slow. Peering speed varies from PG to
> PG, and some PG may even take 10 seconds. But, there is no log for 10
> seconds.
> - I checked the effect of client VM's. Actually, Slow queries of mysql
> occur at the same time.
> 
> There are Ceph OSD logs of both Best and Worst.
> 
> Best Peering Case (0.5 Seconds)
> 2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
> state<Start>: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
> acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> state<Start>: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
> acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting
> 
> Worst Peering Case (11.6 Seconds)
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> state<Start>: transitioning to Stray
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 
> *I wish to know about*
> - Why some PG's take 10 seconds until Peering finishes.
> - Why Ceph log is quiet during peering.
> - Is this symptom intended in Ceph.
> 
> *And please give some advice,*
> - Is there any way to improve peering speed?
> - Or, Is there a way to not affect the client when peering occurs?
> 
> P.S
> - I checked the symptoms in the following environments.
> -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

Reply via email to