[ceph-users] RGW hung, 2 OSDs using 100% CPU

Craig Lewis Wed, 26 Mar 2014 15:04:25 -0700

I noticed that my RGW Replication wasn't doing anything. RGWReplication logs were full of:2014-03-26T11:29:01.802 18698:ERROR:root:Could not retrieve region mapfrom destination

Traceback (most recent call last):

File "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line262, in main

    region_map = client.get_region_map(dest_conn)
<snip>

I tried to get a list of buckets in the slave zone, and it hangs. Thesimpliest operation I can run that hangs is:

radosgw-admin --name=client.radosgw.ceph1c regions list

This cluster has 2 nodes. Watching ceph -w, I see 2 OSDs get kicked outof the cluster. OSDs 4 and 8 are both flapping. An excerpt from ceph -w:2014-03-26 11:24:33.582757 osd.9 [WRN] 1 slow requests, 1 includedbelow; oldest blocked for > 30.838374 secs2014-03-26 11:24:33.582762 osd.9 [WRN] slow request 30.838374 secondsold, received at 2014-03-26 11:24:02.744331: osd_op(client.41087064.0:8notify.0 [watch add cookie 1 ver 0] 9.4322fa9f e8270) v4 currentlywaiting for subops from [4]2014-03-26 11:24:35.583064 osd.9 [WRN] 2 slow requests, 1 includedbelow; oldest blocked for > 32.838703 secs2014-03-26 11:24:35.583068 osd.9 [WRN] slow request 30.012828 secondsold, received at 2014-03-26 11:24:05.570206: osd_op(client.41087080.0:8notify.0 [watch add cookie 1 ver 0] 9.4322fa9f e8270) v4 currentlywaiting for subops from [4]2014-03-26 11:28:10.612684 osd.9 [WRN] 4 slow requests, 1 includedbelow; oldest blocked for > 247.868328 secs2014-03-26 11:28:10.612692 osd.9 [WRN] slow request 240.276922 secondsold, received at 2014-03-26 11:24:10.335737: osd_op(client.41087104.0:8notify.0 [watch add cookie 1 ver 0] 9.4322fa9f e8270) v4 currentlywaiting for subops from [4]

2014-03-26 11:31:08.198483 mon.0 [INF] osd.4 marked itself down
2014-03-26 11:31:08.266406 mon.0 [INF] osdmap e8271: 16 osds: 15 up, 16 in
2014-03-26 11:31:21.294695 mon.0 [INF] osd.4 10.193.0.6:6812/2215 boot

2014-03-26 11:33:55.856801 mon.0 [INF] osd.8 marked down after no pgstats for 900.157048seconds

2014-03-26 11:33:55.894864 mon.0 [INF] osdmap e8275: 16 osds: 15 up, 16 in
2014-03-26 11:47:30.952634 mon.0 [INF] osd.8 out (down for 815.058447)

I set the noout flag.

This is a production slave zone. I can play with the cluster as long asreplication is off. By playing with OSD 4 and 8, I found that I canmake the cluster work fine with OSD 4 up, and OSD 8 down. If I startODS 8, and stop OSD 4, RGW is still hung. If I start both, RGW is stillhung.

I saw some mailing list posts where somebody had a similar issue, andfixing the time solved it. I verified that my times are correct on bothmachines.

I haven't seen any XFS issues, but just to be sure, I umounted bothdevices, and ran xfs_check. No errors were reported.

Since all the easy things appear to be in order, I bumped up logging onboth nodes:

[osd]
  osd journal size = 6144
  osd max backfills = 1

osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -ssize=4096

  osd mkfs type = xfs
  osd mount options xfs = rw,noatime,nodiratime,nosuid,noexec,inode64
  osd recovery op priority = 1
  debug osd = 5
  debug filestore = 5
  debug journal = 1
  debug monc = 20

At 14:14:00, I started OSD 4, and waited for ceph-w to stabilize. CPUusage was normal.At 14:15:10, I ran radosgw-admin --name=client.radosgw.ceph1c regionslist && radosgw-admin --name=client.radosgw.ceph1c regionmap get. Itreturned successfully.At 14:16:00, I started OSD 8, and waited for ceph -w to stabilize. CPUusage started out normal, but went to 100% before 14:16:40.At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c regionslist && radosgw-admin --name=client.radosgw.ceph1c regionmap get.regions list hung, and I killed At 14:18:15, I stopped ceph-osd id=8.At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c regionslist && radosgw-admin --name=client.radosgw.ceph1c regionmap get. Itreturned successfully.

At 14:19:10, I stopped ceph-osd id=8.

Looking at the logs, nothing jumps out at me between 14:16:00 and14:16:40. The logs are busy, since OSD 8 is backfilling.

I don't think it's related, but I'd like to point out that all of theRGW pools have snapshots:

root@ceph1c:/var/log/ceph# rados lssnap -p .us-central-1.rgw.root
3    2014-03-15T02:00:02    2014.03.15 02:00:02
4    2014-03-16T02:00:01    2014.03.16 02:00:02
5    2014-03-17T02:00:01    2014.03.17 02:00:01
6    2014-03-18T02:00:01    2014.03.18 02:00:01
7    2014-03-19T02:00:01    2014.03.19 02:00:01
8    2014-03-20T02:00:01    2014.03.20 02:00:01
9    2014-03-21T02:00:01    2014.03.21 02:00:01
10    2014-03-22T02:00:01    2014.03.22 02:00:01
11    2014-03-23T02:00:02    2014.03.23 02:00:02
12    2014-03-24T02:00:02    2014.03.24 02:00:02
13    2014-03-25T02:00:02    2014.03.25 02:00:02
14    2014-03-26T02:00:01    2014.03.26 02:00:01
12 snaps

That's only in my slave zone. The master zone (which is working fine)has no snapshots.

For now, I've removed the debugging, started OSD 4, and startedradosgw-agent. OSD 8 remains off.


There are 43 PGs that live on OSD 4 and 8:

root@ceph1c:/var/log/ceph# ceph pg dump | fgrep '[4]' | cut -d"." -f1 |sort | uniq -c

root@ceph1c:/var/log/ceph# ceph osd dump | grep pool | egrep 'pool(11|13|14|18|19|20|21)'pool 11 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hashrjenkins pg_num 2048 pgp_num 2048 last_change 8014 owner 0 flags hashpspoolpool 13 '.us-central-1.log' rep size 2 min_size 1 crush_ruleset 0object_hash rjenkins pg_num 32 pgp_num 32 last_change 8007 owner 0 flagshashpspoolpool 14 '.us-central-1.intent-log' rep size 2 min_size 1 crush_ruleset 0object_hash rjenkins pg_num 32 pgp_num 32 last_change 8006 owner 0 flagshashpspoolpool 18 '.us-central-1.users.swift' rep size 2 min_size 1 crush_ruleset0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 8010 owner 0flags hashpspoolpool 19 '.us-central-1.users.uid' rep size 2 min_size 1 crush_ruleset 0object_hash rjenkins pg_num 32 pgp_num 32 last_change 8011 owner 0 flagshashpspoolpool 20 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hashrjenkins pg_num 32 pgp_num 32 last_change 207 owner 0 flags hashpspoolpool 21 '.rgw.control' rep size 2 min_size 1 crush_ruleset 0 object_hashrjenkins pg_num 32 pgp_num 32 last_change 208 owner 0 flags hashpspool




The two OSD logs from 14:14:00 to 14:19:10 can be found at
OSD 4: https://cd.centraldesktop.com/p/eAAAAAAADL0vAAAAAABMSio
OSD 8: https://cd.centraldesktop.com/p/eAAAAAAADL05AAAAAFAVWRg


Any suggestions or ideas are appreciated.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*

Connect with us Website <http://www.centraldesktop.com/> | Twitter<http://www.twitter.com/centraldesktop> | Facebook<http://www.facebook.com/CentralDesktop> | LinkedIn<http://www.linkedin.com/groups?gid=147417> | Blog<http://cdblog.centraldesktop.com/>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW hung, 2 OSDs using 100% CPU

Reply via email to