Hi, I got it. ceph health details HEALTH_WARN 3 pgs peering; 3 pgs stuck inactive; 5 pgs stuck unclean; recovery 64/38277874 degraded (0.000%) pg 5.df9 is stuck inactive for 138669.746512, current state peering, last acting [87,2,151] pg 5.a82 is stuck inactive for 138638.121867, current state peering, last acting [151,87,42] pg 5.80d is stuck inactive for 138621.069523, current state peering, last acting [151,47,87] pg 5.df9 is stuck unclean for 138669.746761, current state peering, last acting [87,2,151] pg 5.ae2 is stuck unclean for 139479.810499, current state active, last acting [87,151,28] pg 5.7b6 is stuck unclean for 139479.693271, current state active, last acting [87,105,2] pg 5.a82 is stuck unclean for 139479.713859, current state peering, last acting [151,87,42] pg 5.80d is stuck unclean for 139479.800820, current state peering, last acting [151,47,87] pg 5.df9 is peering, acting [87,2,151] pg 5.a82 is peering, acting [151,87,42] pg 5.80d is peering, acting [151,47,87] recovery 64/38277874 degraded (0.000%)
osd pg query for 5.df9: { "state": "peering", "up": [ 87, 2, 151], "acting": [ 87, 2, 151], "info": { "pgid": "5.df9", "last_update": "119454'58844953", "last_complete": "119454'58844953", "log_tail": "119454'58843952", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 365, "last_epoch_started": 119456, "last_epoch_clean": 119456, "last_epoch_split": 117806, "same_up_since": 119458, "same_interval_since": 119458, "same_primary_since": 119458, "last_scrub": "119442'58732630", "last_scrub_stamp": "2013-06-29 20:02:24.817352", "last_deep_scrub": "119271'57224023", "last_deep_scrub_stamp": "2013-06-23 02:04:49.654373", "last_clean_scrub_stamp": "2013-06-29 20:02:24.817352"}, "stats": { "version": "119454'58844953", "reported": "119458'42382189", "state": "peering", "last_fresh": "2013-06-30 20:35:29.489826", "last_change": "2013-06-30 20:35:28.469854", "last_active": "2013-06-30 20:33:24.126599", "last_clean": "2013-06-30 20:33:24.126599", "last_unstale": "2013-06-30 20:35:29.489826", "mapping_epoch": 119455, "log_start": "119454'58843952", "ondisk_log_start": "119454'58843952", "created": 365, "last_epoch_clean": 365, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "119442'58732630", "last_scrub_stamp": "2013-06-29 20:02:24.817352", "last_deep_scrub": "119271'57224023", "last_deep_scrub_stamp": "2013-06-23 02:04:49.654373", "last_clean_scrub_stamp": "2013-06-29 20:02:24.817352", "log_size": 135341, "ondisk_log_size": 135341, "stats_invalid": "0", "stat_sum": { "num_bytes": 1010563373, "num_objects": 3099, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 302, "num_read_kb": 0, "num_write": 32264, "num_write_kb": 798650, "num_scrub_errors": 0, "num_objects_recovered": 8235, "num_bytes_recovered": 2085653757, "num_keys_recovered": 249061471}, "stat_cat_sum": {}, "up": [ 87, 2, 151], "acting": [ 87, 2, 151]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 119454}, "recovery_state": [ { "name": "Started\/Primary\/Peering\/GetLog", "enter_time": "2013-06-30 20:35:28.545478", "newest_update_osd": 2}, { "name": "Started\/Primary\/Peering", "enter_time": "2013-06-30 20:35:28.469841", "past_intervals": [ { "first": 119453, "last": 119454, "maybe_went_rw": 1, "up": [ 87, 2, 151], "acting": [ 87, 2, 151]}, { "first": 119455, "last": 119457, "maybe_went_rw": 1, "up": [ 2, 151], "acting": [ 2, 151]}], "probing_osds": [ 2, 87, 151], "down_osds_we_would_probe": [], "peering_blocked_by": []}, { "name": "Started", "enter_time": "2013-06-30 20:35:28.469765"}]} For other PGs: https://www.dropbox.com/s/q5iv8lwzecioy3d/pg_query.tar.tz -- Regards Dominik 2013/6/30 Andrey Korolyov <and...@xdel.ru>: > That`s not a loop as it looks, sorry - I had reproduced issue many > times and there is no such cpu-eating behavior in most cases, only > locked pgs are presented. Also I may celebrate returning of 'wrong > down mark' bug, at least for the 0.61.4 tag. For first one, I`ll send > a link with core as quick as I will be able to reproduce it on my test > env, and second one linked with 100% disk utilization, so I`m not sure > if this is right behavior or wrong. > > On Sat, Jun 29, 2013 at 1:28 AM, Sage Weil <s...@inktank.com> wrote: >> On Sat, 29 Jun 2013, Andrey Korolyov wrote: >>> There is almost same problem with the 0.61 cluster, at least with same >>> symptoms. Could be reproduced quite easily - remove an osd and then >>> mark it as out and with quite high probability one of neighbors will >>> be stuck at the end of peering process with couple of peering pgs with >>> primary copy on it. Such osd process seems to be stuck in some kind of >>> lock, eating exactly 100% of one core. >> >> Which version? >> Can you attach with gdb and get a backtrace to see what it is chewing on? >> >> Thanks! >> sage >> >> >>> >>> On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum <g...@inktank.com> wrote: >>> > On Thu, Jun 13, 2013 at 6:33 AM, S?awomir Skowron <szi...@gmail.com> >>> > wrote: >>> >> Hi, sorry for late response. >>> >> >>> >> https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view >>> >> >>> >> Logs in attachment, and on google drive, from today. >>> >> >>> >> https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view >>> >> >>> >> We have such problem today. And new logs are on google drive with today >>> >> date. >>> >> >>> >> Strange is that problematic osd.71 have about 10-15%, more space used >>> >> then other osd in cluster. >>> >> >>> >> Today in one hour osd.71 fails 3 times in mon log, and after third >>> >> recovery has been stuck, and many 500 errors appears in http layer on >>> >> top of rgw. When it's stuck, restarting osd71, osd.23, and osd.108, >>> >> all from stucked pg, helps, but i run even repair on this osd, just in >>> >> case. >>> >> >>> >> I have some theory, that on this pg is rgw index of objects, or one of >>> >> osd in this pg, have some problems with local filesystem or drive >>> >> bellow (raid controller reports nothing about that), but i do not see >>> >> any problem in system. >>> >> >>> >> How can we find in which pg/osd index of objects in rgw bucket exist ?? >>> > >>> > You can find the location of any named object by grabbing the OSD map >>> > from the cluster and using the osdmaptool: "osdmaptool <mapfile> >>> > --test-map-object <objname> --pool <poolid>". >>> > >>> > You're not providing any context for your issue though, so we really >>> > can't help. What symptoms are you observing? >>> > -Greg >>> > Software Engineer #42 @ http://inktank.com | http://ceph.com >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@lists.ceph.com >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Pozdrawiam Dominik _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com