Hi all,

We are currently in process of enlarging our bobtail cluster size by adding
OSDs. We have 12 disks per machine and we are creating one OSD per disk,
adding them one by one as recommended. Only thing we don't do is starting
with a small weight and increasing it slowly. Weights are all 1.

In this scenario both rbd and radosgw are unable to respond only in the
first two minutes of adding a new OSD. After that small hiccup, we have
some pgs like active+remapped+wait_backfill, active+remapped+backfilling,
active+recovery_wait+remapped, active+degraded+remapped+backfilling and
everything works OK. After a few hours of backfilling and recovery all pgs
come active+clean and we add another OSD.

But sometimes, that small hiccup takes longer than a few minutes. In that
times status shows some pgs are stuck in active and some are stuck in
peering. When we look at the pg dump we see all those active or peering pgs
are on the same 2 OSDs and are unable to move forward. At this stage rbd
works poorly and radosgw is completely stalled. Only after restarting one
of those 2 OSDs, pg's start to backfill and clients continue with their
operations.

Since this is a live cluster we don't want to wait too long and usually go
restart the OSD in a hurry. That's why i cannot currently provide status or
pg query outputs. We have some logs but i don't know what to look for or if
they are verbose enough.

Can this be any kind of a known issue? If not, where should i look to get
any ideas about what's happening when it occurs?

Thanks in advance

-- 
erdem agaoglu
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to