Hello ceph community,
Last week I was increasing the PGs in a pool used for RBD, in a attempt
to reach 1024 PGs (from 128 PGs). The increments were of 32 each time
and after creating the new placement groups I trigger re-balance of data
using the pgp_num parameter.
Every thing was fine until the pool reach the ~400 PGs. Before 414 PGs,
the cluster interrupted the client io for 10 seconds approx., while
creating the new 32 PGs, which was fine for the SLA we try to
accomplish. After 414 PGs that interruption was longer, reaching 40
seconds and some downtime in our virtual machines which last 1 minute
more or less and hundreds of blocked ops in the ceph log.
I would like to understand how the client io interruption took longer
when I had more PGs. I've bee unable to figure that out from the
documentation and distribution list.
Some info of the cluster:
* n° OSD: 24. This cluster born with 6 OSDs.
* 3 OSD nodes.
* 3 monitors.
* version: Jewel 10.2.10
* OSD backend disks: HDD
* OSD journal disks: SSD
Let me know if you need further information and thanks in advance.
Kind regards to you all.
--
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.
http://www.altavoz.net
Viña del Mar, Valparaiso:
2 Poniente 355 of 53
+56 32 276 8060
Providencia, Santiago:
Antonio Bellet 292 of 701
+56 2 585 4264
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com