Well, this is embarrassing.

After working on this for a week, it finally created last night. The only thing that changed in the past 2 days was that I ran ceph osd unset noscrub and ceph osd unset nodeep-scrub. I had disabled both scrubs in the hope that backfilling would finish faster.


I only had the default logging level, and the OSD logs don't show anything interesting.


Looking at ceph.log when the incomplete goes away... it's complicated. 4 of my 16 OSDs were kicked out for being unresponsive. They stayed down and out for about 4 hours, until I restarted them. At some point, the incomplete went away, without ever going into creating.


I've had a lot of problems with OSDs doing being marked down and out in this cluster. It started with some extreme OSD slowness caused by kswapd problems. I think I have that under control now. But during that time, the OSDs thrashed themselves so hard, some of them aren't stable anymore. I still have two OSDs marked out in this cluster. If I mark them in, as soon as they start backfilling, they start using 100% CPU, and the other OSDs complain that they're not responding to heartbeats. So far, the 14 OSDs that are IN are remapping fine. If remapping completes, I plan to zap those two OSDs and re-add them.





On 4/16/14 22:10 , Gregory Farnum wrote:
Do you have any logging running on those OSDs? I'm going to need to
get somebody else to look at this, but if we could check the probe
messages being sent that might be helpful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 15, 2014 at 4:36 PM, Craig Lewis <cle...@centraldesktop.com> wrote:
http://pastebin.com/ti1VYqfr

I assume the problem is at the very end:
           "probing_osds": [
                 0,
                 2,
                 3,
                 4,
                 11,
                 13],
           "down_osds_we_would_probe": [],
           "peering_blocked_by": []},


OSDs 3, 4, and 11 have been UP and IN for hours.  OSDs 0, 2, and 13 have
been UP and IN since the problems started, but they never complete probing.




Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com

Central Desktop. Work together in ways you never thought possible.
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog

On 4/15/14 16:07 , Gregory Farnum wrote:

What are the results of "ceph osd pg 11.483 query"?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 15, 2014 at 4:01 PM, Craig Lewis <cle...@centraldesktop.com>
wrote:

I have 1 incomplete PG.  The data is gone, but I can upload it again.  I
just need to make the cluster start working so I can upload it.

I've read a bunch of mailling list posts, and found ceph pg force_create_pg.
Except, it doesn't work.

I run:
root@ceph1c:/var/lib/ceph/osd# ceph pg force_create_pg 11.483
pg 11.483 now creating, ok

The incomplete PG switches to creating.  It sits in creating for a while,
then flips back to incomplete:
2014-04-15 15:06:11.876535 mon.0 [INF] pgmap v5719605: 2592 pgs: 2586
active+clean, 1 incomplete, 5 active+clean+scrubbing+deep; 15086 GB data,
27736 GB used, 28127 GB / 55864 GB avail
2014-04-15 15:06:13.899681 mon.0 [INF] pgmap v5719606: 2592 pgs: 1 creating,
2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
used, 28127 GB / 55864 GB avail
2014-04-15 15:06:14.965676 mon.0 [INF] pgmap v5719607: 2592 pgs: 1 creating,
2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
used, 28127 GB / 55864 GB avail
2014-04-15 15:06:15.995570 mon.0 [INF] pgmap v5719608: 2592 pgs: 1 creating,
2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
used, 28127 GB / 55864 GB avail
2014-04-15 15:06:17.019972 mon.0 [INF] pgmap v5719609: 2592 pgs: 1 creating,
2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
used, 28127 GB / 55864 GB avail
2014-04-15 15:06:18.048487 mon.0 [INF] pgmap v5719610: 2592 pgs: 1 creating,
2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
used, 28127 GB / 55864 GB avail
2014-04-15 15:06:19.093757 mon.0 [INF] pgmap v5719611: 2592 pgs: 2586
active+clean, 1 incomplete, 5 active+clean+scrubbing+deep; 15086 GB data,
27736 GB used, 28127 GB / 55864 GB avail

I'm on:
root@ceph0c:~# ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

root@ceph0c:~# uname -a
Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12 UTC
2014 x86_64 x86_64 x86_64 GNU/Linux

root@ceph0c:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.4 LTS"

--

Craig Lewis
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com

Central Desktop. Work together in ways you never thought possible.
Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to