[ceph-users] locked up cluster while recovering OSD

Ludovico Cavedon Sun, 25 Oct 2015 14:24:49 -0700

Hi,

we have a Ceph cluster with:
- 12 OSDs on 6 physical nodes, 64 GB RAM
- each OSD has a 6 TB spinning disk and a 10GB journal in ram (tmpfs) [1]
- 3 redundant copies
- 25% space usage so far
- ceph 0.94.2.
- store data via radosgw, using sharded bucket indexes (64 shards).
- 500 PGs per node (as we are planning on scaling the number of nodes
without adding more pools in the future).


We currently have a constant write load (about 60 PUTs per second of small
objects, usually a few KB, but sometimes they can go up to a few MB).

If I restart an OSD, it seems that most operations get stuck for up to
multiple minutes until the OSD is done recovering.
(noout is set, but I understand it does not matter because the the OSD is
down for less than 5 minutes).

Most of the "slow operation" messages had the following reasons:
- currently waiting for rw locks
- currently waiting for missing object
- currently waiting for degraded object

And were:
- [call rgw.bucket_prepare_op] ... ondisk+write+known_if_redirected
- [call rgw.bucket_complete_op] ... ondisk+write+known_if_redirected

operating mostly on the bucket index shard objects.

The monitors and gateways look completely unloaded.
On the other side it looks like the IO on the OSDs is very intense (average
disk write completion time is 300 ms) and the disk IO utilization is around
50%.

It looks to me the storage layer needs to be improved (RAID controller with
big write-back cache maybe?).
However I do not understand exactly what is going wrong here.
I would expect that the operations keep being served  as before either
writing to the primary PG  or to the replica, and the PGs would recover in
the background.
Do you have any ideas?
What path would you follow to understand what the problem is?
I am happy to provide more logs if that helps.

Thanks in advance for any help,
Ludovico

[1] We had to disable filestore_fadivse, otherwise two threads per OSD
would get stuck on 100% CPU moving pages from ram (presumably the journal)
to the swap.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] locked up cluster while recovering OSD

Reply via email to