Re: [ceph-users] Power outages!!! help!

hjcho616 Fri, 01 Sep 2017 21:10:56 -0700

Just realized there is a file called superblock in the ceph directory.  ceph-1 
and ceph-2's superblock file is identical, ceph-6 and ceph-7 are identical, but 
not between the two groups.  When I originally created the OSDs, I created 
ceph-0 through 5.  Can superblock file be copied over from ceph-1 to ceph-0?
Hmm.. it appears to be doing something in the background even though osd.0 is 
down.  ceph health output is changing!# ceph healthHEALTH_ERR 40 pgs are stuck 
inactive for more than 300 seconds; 14 pgs backfill_wait; 21 pgs degraded; 10 
pgs down; 2 pgs inconsistent; 10 pgs peering; 3 pgs recovering; 2 pgs 
recovery_wait; 30 pgs stale; 21 pgs stuck degraded; 10 pgs stuck inactive; 30 
pgs stuck stale; 45 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs 
undersized; 2 requests are blocked > 32 sec; recovery 221826/2473662 objects 
degraded (8.968%); recovery 254711/2473662 objects misplaced (10.297%); 
recovery 103/2251966 unfound (0.005%); 7 scrub errors; mds cluster is degraded; 
no legacy OSD present but 'sortbitwise' flag is not set
Regards,Hong


    On Friday, September 1, 2017 10:37 PM, hjcho616 <[email protected]> wrote:
 

 Tried connecting recovered osd.  Looks like some of the files in the 
lost+found are super blocks.  Below is the log.  What can I do about this?
2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 
(ceph:ceph)2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 
54322017-09-01 22:27:27.635456 7f68837e5800  0 pidfile_write: ignore empty 
--pid-file2017-09-01 22:27:27.646849 7f68837e5800  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)2017-09-01 
22:27:27.647077 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option2017-09-01 22:27:27.647080 
7f68837e5800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) 
detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' 
config option2017-09-01 22:27:27.647091 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is 
supported2017-09-01 22:27:27.678937 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)2017-09-01 22:27:27.679044 
7f68837e5800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: 
extsize is disabled by conf2017-09-01 22:27:27.680718 7f68837e5800  1 leveldb: 
Recovering log #280542017-09-01 22:27:27.804501 7f68837e5800  1 leveldb: Delete 
type=0 #28054
2017-09-01 22:27:27.804579 7f68837e5800  1 leveldb: Delete type=3 #28053
2017-09-01 22:27:35.586725 7f68837e5800  0 filestore(/var/lib/ceph/osd/ceph-0) 
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled2017-09-01 
22:27:35.587689 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.589631 7f68837e5800  1 journal _open /var/lib/ceph/osd/ceph-0/journal 
fd 18: 9998729216 bytes, block size 4096 bytes, directio = 1, aio = 12017-09-01 
22:27:35.590041 7f68837e5800  1 filestore(/var/lib/ceph/osd/ceph-0) 
upgrade2017-09-01 22:27:35.590149 7f68837e5800 -1 
filestore(/var/lib/ceph/osd/ceph-0) could not find 
#-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or 
directory2017-09-01 22:27:35.590158 7f68837e5800 -1 osd.0 0 OSD::init() : 
unable to read osd superblock2017-09-01 22:27:35.590547 7f68837e5800  1 journal 
close /var/lib/ceph/osd/ceph-0/journal2017-09-01 22:27:35.611595 7f68837e5800 
-1 ^[[0;31m ** ERROR: osd init failed: (22) Invalid argument^[[0m
Recovered drive is mounted on /var/lib/ceph/osd/ceph-0.# dfFilesystem      
1K-blocks      Used  Available Use% Mounted onudev                10240         
0      10240   0% /devtmpfs             1584780      9172    1575608   1% 
/run/dev/sda1        15247760   9319048    5131120  65% /tmpfs             
3961940         0    3961940   0% /dev/shmtmpfs                5120         0   
    5120   0% /run/locktmpfs             3961940         0    3961940   0% 
/sys/fs/cgroup/dev/sdb1      1952559676 634913968 1317645708  33% 
/var/lib/ceph/osd/ceph-0/dev/sde1      1952559676 640365952 1312193724  33% 
/var/lib/ceph/osd/ceph-6/dev/sdd1      1952559676 712018768 1240540908  37% 
/var/lib/ceph/osd/ceph-2/dev/sdc1      1952559676 755827440 1196732236  39% 
/var/lib/ceph/osd/ceph-1/dev/sdf1       312417560  42538060  269879500  14% 
/var/lib/ceph/osd/ceph-7tmpfs              792392         0     792392   0% 
/run/user/0# cd /var/lib/ceph/osd/ceph-0# lsactivate.monmap  current  
journal_uuid  magic          superblock  whoamiactive           fsid     
keyring       ready          sysvinitceph_fsid        journal  lost+found    
store_version  type
Regards,Hong 

    On Friday, September 1, 2017 2:59 PM, hjcho616 <[email protected]> wrote:
 

 Found the partition, wasn't able to mount the partition right away... Did a 
xfs_repair on that drive.  
Got bunch of messages like this.. =(entry 
"100000a89fd.00000000__head_AE319A25__0" in shortform directory 845908970 
references non-existent inode 605294241               junking entry 
"100000a89fd.00000000__head_AE319A25__0" in directory inode 845908970           
Was able to mount.  lost+found has lots of files there. =P  Running du seems to 
show OK files in current directory.
Will it be safe to attach this one back to the cluster?  Is there a way to 
specify to use this drive if the data is missing? =)  Or am I being paranoid?  
Just plug it? =)
Regards,Hong 

    On Friday, September 1, 2017 9:01 AM, hjcho616 <[email protected]> wrote:
 

 Looks like it has been rescued... Only 1 error as we saw before in the smart 
log!# ddrescue -f /dev/sda /dev/sdc ./rescue.logGNU ddrescue 1.21Press Ctrl-C 
to interrupt     ipos:    1508 GB, non-trimmed:        0 B,  current rate:      
 0 B/s     opos:    1508 GB, non-scraped:        0 B,  average rate:  88985 
kB/snon-tried:        0 B,     errsize:     4096 B,      run time:  6h 14m 40s  
rescued:    2000 GB,      errors:        1,  remaining time:         n/apercent 
rescued:  99.99%      time since last successful read:         39sFinished      
                 
Still missing partition in the new drive. =P  I found this util called testdisk 
for broken partition tables.  Will try that tonight. =P
Regards,Hong
 

    On Wednesday, August 30, 2017 9:18 AM, Ronny Aasen 
<[email protected]> wrote:
 

  On 30.08.2017 15:32, Steve Taylor wrote:
  
 
I'm not familiar with dd_rescue, but I've just been reading about it. I'm not 
seeing any features that would be beneficial in this scenario that aren't also 
available in dd. What specific features give it "really a far better chance of 
restoring a copy of your disk" than dd? I'm always interested in learning about 
new recovery tools. 
 i see i wrote dd_rescue from old habit, but the package one should use on 
debian is gddrescue or also called gnu ddrecue. 
 
 this page have some details on the differences on dd vs the ddrescue variants. 
 http://www.toad.com/gnu/sysadmin/index.html#ddrescue
 
 kind regards
 Ronny Aasen
 
 
 
     
|    |  Steve Taylor | Senior Software Engineer | StorageCraft Technology 
Corporation
 380 Data Drive Suite 300 | Draper | Utah | 84020
 Office: 801.871.2799 |   |

  
| If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any  
attachments, and be advised that any dissemination or copying of this message 
is prohibited. |

  
 On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote: 
 On 29-8-2017 19:12, Steve Taylor wrote:

Hong,Probably your best chance at recovering any data without 
special,expensive, forensic procedures is to perform a dd from /dev/sdb 
tosomewhere else large enough to hold a full disk image and attempt torepair 
that. You'll want to use 'conv=noerror' with your dd commandsince your disk is 
failing. Then you could either re-attach the OSDfrom the new source or attempt 
to retrieve objects from the filestoreon it.
Like somebody else already pointed outIn problem "cases like disk, use 
dd_rescue.It has really a far better chance of restoring a copy of your 
disk--WjW
I have actually done this before by creating an RBD that matches thedisk size, 
performing the dd, running xfs_repair, and eventuallyadding it back to the 
cluster as an OSD. RBDs as OSDs is certainly atemporary arrangement for repair 
only, but I'm happy to report that itworked flawlessly in my case. I was able 
to weight the OSD to 0,offload all of its data, then remove it for a full 
recovery, at whichpoint I just deleted the RBD.The possibilities afforded by 
Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | 
StorageCraft Technology Corporation380 Data Drive Suite 300 | Draper | Utah | 
84020Office: 801.871.2799 | If you are not the intended recipient of this 
message or received it erroneously, please notify the sender and delete it, 
together with any attachments, and be advised that any dissemination or copying 
of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz 
Kusmierz wrote:
Rule of thumb with batteries is:- more “proper temperature” you run them at the 
more life you get outof them- more battery is overpowered for your application 
the longer it willsurvive. Get your self a LSI 94** controller and use it as 
HBA and you will befine. but get MORE DRIVES !!!!! … 
On 28 Aug 2017, at 23:10, hjcho616 <[email protected]> wrote:Thank you Tomasz 
and Ronny.  I'll have to order some hdd soon andtry these out.  Car battery 
idea is nice!  I may try that.. =)  Dothey last longer?  Ones that fit the UPS 
original battery specdidn't last very long... part of the reason why I gave up 
on them..=P  My wife probably won't like the idea of car battery hanging 
outthough ha!The OSD1 (one with mostly ok OSDs, except that smart 
failure)motherboard doesn't have any additional SATA connectors available. 
Would it be safe to add another OSD host?Regards,HongOn Monday, August 28, 2017 
4:43 PM, Tomasz Kusmierz <[email protected]> wrote:Sorry for being brutal 
… anyway 1. get the battery for UPS ( a car battery will do as well, I’vemoded 
on ups in the past with truck battery and it was working likea charm :D )2. get 
spare drives and put those in because your cluster CAN NOTget out of error due 
to lack of space3. Follow advice of Ronny Aasen on hot to recover data from 
harddrives 4 get cooling to drives or you will loose more ! 
On 28 Aug 2017, at 22:39, hjcho616 <[email protected]> wrote:Tomasz,Those 
machines are behind a surge protector.  Doesn't appear tobe a good one!  I do 
have a UPS... but it is my fault... nobattery.  Power was pretty reliable for a 
while... and UPS wasjust beeping every chance it had, disrupting some sleep.. 
=P  Sorunning on surge protector only.  I am running this in homeenvironment.   
So far, HDD failures have been very rare for thisenvironment. =)  It just 
doesn't get loaded as much!  I am notsure what to expect, seeing that "unfound" 
and just a feeling ofpossibility of maybe getting OSD back made me excited 
about it.=) Thanks for letting me know what should be the priority.  Ijust lack 
experience and knowledge in this. =) Please do continueto guide me though this. 
Thank you for the decode of that smart messages!  I do agree thatlooks like it 
is on its way out.  I would like to know how to getgood portion of it back if 
possible. =)I think I just set the size and min_size to 1.# ceph osd lspools0 
data,1 metadata,2 rbd,# ceph osd pool set rbd size 1set pool 2 size to 1# ceph 
osd pool set rbd min_size 1set pool 2 min_size to 1Seems to be doing some 
backfilling work.# ceph healthHEALTH_ERR 22 pgs are stuck inactive for more 
than 300 seconds; 2pgs backfill_toofull; 74 pgs backfill_wait; 3 pgs 
backfilling;108 pgs degraded; 6 pgs down; 6 pgs inconsistent; 6 pgs peering;7 
pgs recovery_wait; 16 pgs stale; 108 pgs stuck degraded; 6 pgsstuck inactive; 
16 pgs stuck stale; 130 pgs stuck unclean; 101pgs stuck undersized; 101 pgs 
undersized; 1 requests are blocked
32 sec; recovery 1790657/4502340 objects degraded (39.772%);
recovery 641906/4502340 objects misplaced (14.257%); recovery147/2251990 
unfound (0.007%); 50 scrub errors; mds cluster isdegraded; no legacy OSD 
present but 'sortbitwise' flag is not setRegards,HongOn Monday, August 28, 2017 
4:18 PM, Tomasz Kusmierz <[email protected]> wrote:So to decode few things 
about your disk:  1 Raw_Read_Error_Rate    0x002f  100  100  051    Pre-fail 
Always      -      3737 read erros and only one sector marked as pending - fun 
disk:/ 181 Program_Fail_Cnt_Total  0x0022  099  099  000    Old_age Always      
-      35325174So firmware has quite few bugs, that’s nice191 
G-Sense_Error_Rate      0x0022  100  100  000    Old_age Always      -      
2855disk was thrown around while operational even more nice.194 
Temperature_Celsius    0x0002  047  041  000    Old_age Always      -      53 
(Min/Max 15/59)if your disk passes 50 you should not consider using it, 
hightemperatures demagnetise plate layer and you will see more errorsin very 
near future.197 Current_Pending_Sector  0x0032  100  100  000    Old_age Always 
     -      1as mentioned before :)200 Multi_Zone_Error_Rate  0x002a  100  100  
000    Old_age Always      -      4222your heads keep missing tracks … bent ? I 
don’t even know how tocomment here.generally fun drive you’ve got there … 
rescue as much as you canand throw it away !!!


_______________________________________________ceph-users mailing 
[email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  
 _______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  _______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

Reply via email to