[ceph-users] Power outages!!! help!

hjcho616 Sun, 27 Aug 2017 23:02:19 -0700

Hello!
I've been using ceph for long time mostly for network CephFS storage, even 
before Argonaut release!  It's been working very well for me.  Yes, I had some 
power outtages before and asked few questions on this list before and got 
resolved happily!  Thank you all!
Not sure why but we've been having quite a bit of power outages lately.  Ceph 
appear to be running OK with those going on.. so I was pretty happy and didn't 
thought much of it... till yesterday, When I started to move some videos to 
cephfs, ceph decided that it was full although df showed only 54% utilization!  
Then I looked up, some of the osds were down! (only 3 at that point!)
I am running pretty simple ceph configuration... I have one machine running MDS 
and mon named MDS1.  Two OSD machines with 5 2TB HDDs and 1 SSD for journal 
named OSD1 and OSD2.
At the time, I was running jewel 10.2.2. I looked at some of downed OSD's log 
file and googled some of them... they appeared to be tied to version 10.2.2.  
So I just upgraded all to 10.2.9.  Well that didn't solve my problems.. =P  
While looking at some of this.. there was another power outage!  D'oh!  I may 
need to invest in a UPS or something... Until this happened, all of the osd 
down were from OSD2.  But OSD1 took a hit!  Couldn't boot, because osd-0 was 
damaged... I tried xfs_repair -L /dev/sdb1 as suggested by command line.. I was 
able to mount it again, phew, reboot... then /dev/sdb1 is no longer accessible! 
 Noooo!!!
So this is what I have today!  I am a bit concerned as half of the osds are 
down!  and osd.0 doesn't look good at all...# ceph osd treeID WEIGHT   TYPE 
NAME     UP/DOWN REWEIGHT PRIMARY-AFFINITY-1 16.24478 root default-2  8.12239   
  host OSD1 1  1.95250         osd.1      up  1.00000          1.00000 0  
1.95250         osd.0    down        0          1.00000 7  0.31239         
osd.7      up  1.00000          1.00000 6  1.95250         osd.6      up  
1.00000          1.00000 2  1.95250         osd.2      up  1.00000          
1.00000-3  8.12239     host OSD2 3  1.95250         osd.3    down        0      
    1.00000 4  1.95250         osd.4    down        0          1.00000 5  
1.95250         osd.5    down        0          1.00000 8  1.95250         
osd.8    down        0          1.00000 9  0.31239         osd.9      up  
1.00000          1.00000
This looked alot better before that last extra power outage... =(  Can't mount 
it anymore!# ceph healthHEALTH_ERR 22 pgs are stuck inactive for more than 300 
seconds; 44 pgs backfill_toofull; 80 pgs backfill_wait; 122 pgs degraded; 6 pgs 
down; 8 pgs inconsistent; 6 pgs peering; 2 pgs recovering; 18 pgs 
recovery_wait; 16 pgs stale; 122 pgs stuck degraded; 6 pgs stuck inactive; 16 
pgs stuck stale; 159 pgs stuck unclean; 102 pgs stuck undersized; 102 pgs 
undersized; 1 requests are blocked > 32 sec; recovery 1803466/4503980 objects 
degraded (40.042%); recovery 692976/4503980 objects misplaced (15.386%); 
recovery 147/2251990 unfound (0.007%); 1 near full osd(s); 54 scrub errors; mds 
cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set
Each of osds are showing different failure signature. 
I've uploaded osd log with debug osd = 20, debug filestore = 20, and debug ms = 
20.  You can find it in below links.  Let me know if there is preferred way to 
share this!https://drive.google.com/open?id=0By7YztAJNGUWQXItNzVMR281Snc 
(ceph-osd.3.log)
https://drive.google.com/open?id=0By7YztAJNGUWYmJBb3RvLVdSQWc 
(ceph-osd.4.log)https://drive.google.com/open?id=0By7YztAJNGUWaXhRMlFOajN6M1k 
(ceph-osd.5.log)
https://drive.google.com/open?id=0By7YztAJNGUWdm9BWFM5a3ExOFE (ceph-osd.8.log)
So how does this look?  Can this be fixed? =)  If so please let me know.  I 
used to take backups but since it grew so big, I wasn't able to do so 
anymore... and would like to get most of these back if I can.  Please let me 
know if you need more info!
Thank you!
Regards,Hong

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Power outages!!! help!

Reply via email to