Now the problem is that ceph has put out two disks because scrub has failed (I think it is not a disk fault but due to mark-complete) How can I: - disable scrub - put in again the two disks
I will wait anyway the end of recovery to be sure it really works again Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco < mgiamma...@gmail.com> ha scritto: > Infact I am worried because: > > 1) ceph is under proxmox, and proxmox may decide to reboot a server if it > is not responding > 2) probably a server was rebooted while ceph was reconstructing > 3) even using max=3 do not help > > Anyway this is the "unofficial" procedure that I am using, much simpler > than blog post: > > 1) find host where is pg > 2) stop ceph in that host > 3) ceph-objectstore-tool --pgid 1.98 --op mark-complete --data-path > /var/lib/ceph/osd/ceph-9 --journal-path /var/lib/ceph/osd/ceph-9/journal > 4) start ceph > 5) look finally it reconstructing > > Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic < > i...@ip-interactive.de> ha scritto: > >> Hi, >> >> removing ONE disk while your replication is 2, is no problem. >> >> You dont need to wait a single second to replace of remove it. Its >> anyway not used and out/down. So from ceph's point of view its not >> existent. >> >> ---------------- >> >> But as christian told you already, what we see now fits to a szenario >> where you lost the osd and eighter you did something, or something else >> happens, but the data were not recovered again. >> >> Eighter because another OSD was broken, or because you did something. >> >> Maybe, because of the "too many PGs per OSD (307 > max 300)" ceph never >> recovered. >> >> What i can see from http://pastebin.com/VZD7j2vN is that >> >> OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's holding the >> incomplete data. >> >> This are 7 OSD's from 10. So something happend to that OSD's or the data >> in them. And that had nothing to do with a single disk failing. >> >> Something else must have been happend. >> >> And as christian already wrote: you will have to go through your logs >> back until the point were things going down. >> >> Because a fail of a single OSD, no matter what your replication size is, >> can ( normally ) not harm the consistency of 7 other OSD's, means 70% of >> your total cluster. >> >> -- >> Mit freundlichen Gruessen / Best regards >> >> Oliver Dzombic >> IP-Interactive >> >> mailto:i...@ip-interactive.de >> >> Anschrift: >> >> IP Interactive UG ( haftungsbeschraenkt ) >> Zum Sonnenberg 1-3 >> 63571 Gelnhausen >> >> HRB 93402 beim Amtsgericht Hanau >> Geschäftsführung: Oliver Dzombic >> >> Steuer Nr.: 35 236 3622 1 >> UST ID: DE274086107 >> >> >> Am 29.06.2016 um 10:56 schrieb Mario Giammarco: >> > Yes I have removed it from crush because it was broken. I have waited 24 >> > hours to see if cephs would like to heals itself. Then I removed the >> > disk completely (it was broken...) and I waited 24 hours again. Then I >> > start getting worried. >> > Are you saying to me that I should not remove a broken disk from >> > cluster? 24 hours were not enough? >> > >> > Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan Arnold Nagy >> > <zol...@linux.vnet.ibm.com <mailto:zol...@linux.vnet.ibm.com>> ha >> scritto: >> > >> > Just loosing one disk doesn’t automagically delete it from CRUSH, >> > but in the output you had 10 disks listed, so there must be >> > something else going - did you delete the disk from the crush map as >> > well? >> > >> > Ceph waits by default 300 secs AFAIK to mark an OSD out after it >> > will start to recover. >> > >> > >> >> On 29 Jun 2016, at 10:42, Mario Giammarco <mgiamma...@gmail.com >> >> <mailto:mgiamma...@gmail.com>> wrote: >> >> >> >> I thank you for your reply so I can add my experience: >> >> >> >> 1) the other time this thing happened to me I had a cluster with >> >> min_size=2 and size=3 and the problem was the same. That time I >> >> put min_size=1 to recover the pool but it did not help. So I do >> >> not understand where is the advantage to put three copies when >> >> ceph can decide to discard all three. >> >> 2) I started with 11 hdds. The hard disk failed. Ceph waited >> >> forever for hard disk coming back. But hard disk is really >> >> completelly broken so I have followed the procedure to really >> >> delete from cluster. Anyway ceph did not recover. >> >> 3) I have 307 pgs more than 300 but it is due to the fact that I >> >> had 11 hdds now only 10. I will add more hdds after I repair the >> pool >> >> 4) I have reduced the monitors to 3 >> >> >> >> >> >> >> >> Il giorno mer 29 giu 2016 alle ore 10:25 Christian Balzer >> >> <ch...@gol.com <mailto:ch...@gol.com>> ha scritto: >> >> >> >> >> >> Hello, >> >> >> >> On Wed, 29 Jun 2016 06:02:59 +0000 Mario Giammarco wrote: >> >> >> >> > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 >> >> object_hash >> >> ^ >> >> And that's the root cause of all your woes. >> >> The default replication size is 3 for a reason and while I do >> >> run pools >> >> with replication of 2 they are either HDD RAIDs or extremely >> >> trustworthy >> >> and well monitored SSD. >> >> >> >> That said, something more than a single HDD failure must have >> >> happened >> >> here, you should check the logs and backtrace all the step you >> >> did after >> >> that OSD failed. >> >> >> >> You said there were 11 HDDs and your first ceph -s output >> showed: >> >> --- >> >> osdmap e10182: 10 osds: 10 up, 10 in >> >> ---- >> >> And your crush map states the same. >> >> >> >> So how and WHEN did you remove that OSD? >> >> My suspicion would be it was removed before recovery was >> complete. >> >> >> >> Also, as I think was mentioned before, 7 mons are overkill 3-5 >> >> would be a >> >> saner number. >> >> >> >> Christian >> >> >> >> > rjenkins pg_num 512 pgp_num 512 last_change 9313 flags >> >> hashpspool >> >> > stripe_width 0 >> >> > removed_snaps [1~3] >> >> > pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0 >> >> object_hash >> >> > rjenkins pg_num 512 pgp_num 512 last_change 9314 flags >> >> hashpspool >> >> > stripe_width 0 >> >> > removed_snaps [1~3] >> >> > pool 2 'rbd3' replicated size 2 min_size 1 crush_ruleset 0 >> >> object_hash >> >> > rjenkins pg_num 512 pgp_num 512 last_change 10537 flags >> >> hashpspool >> >> > stripe_width 0 >> >> > removed_snaps [1~3] >> >> > >> >> > >> >> > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR >> >> > 5 1.81000 1.00000 1857G 984G 872G 53.00 0.86 >> >> > 6 1.81000 1.00000 1857G 1202G 655G 64.73 1.05 >> >> > 2 1.81000 1.00000 1857G 1158G 698G 62.38 1.01 >> >> > 3 1.35999 1.00000 1391G 906G 485G 65.12 1.06 >> >> > 4 0.89999 1.00000 926G 702G 223G 75.88 1.23 >> >> > 7 1.81000 1.00000 1857G 1063G 793G 57.27 0.93 >> >> > 8 1.81000 1.00000 1857G 1011G 846G 54.44 0.88 >> >> > 9 0.89999 1.00000 926G 573G 352G 61.91 1.01 >> >> > 0 1.81000 1.00000 1857G 1227G 629G 66.10 1.07 >> >> > 13 0.45000 1.00000 460G 307G 153G 66.74 1.08 >> >> > TOTAL 14846G 9136G 5710G 61.54 >> >> > MIN/MAX VAR: 0.86/1.23 STDDEV: 6.47 >> >> > >> >> > >> >> > >> >> > ceph version 0.94.7 >> (d56bdf93ced6b80b07397d57e3fa68fe68304432) >> >> > >> >> > http://pastebin.com/SvGfcSHb >> >> > http://pastebin.com/gYFatsNS >> >> > http://pastebin.com/VZD7j2vN >> >> > >> >> > I do not understand why I/O on ENTIRE cluster is blocked >> >> when only few >> >> > pgs are incomplete. >> >> > >> >> > Many thanks, >> >> > Mario >> >> > >> >> > >> >> > Il giorno mar 28 giu 2016 alle ore 19:34 Stefan Priebe - >> >> Profihost AG < >> >> > s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> ha >> >> scritto: >> >> > >> >> > > And ceph health detail >> >> > > >> >> > > Stefan >> >> > > >> >> > > Excuse my typo sent from my mobile phone. >> >> > > >> >> > > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic >> >> <i...@ip-interactive.de <mailto:i...@ip-interactive.de>>: >> >> > > >> >> > > Hi Mario, >> >> > > >> >> > > please give some more details: >> >> > > >> >> > > Please the output of: >> >> > > >> >> > > ceph osd pool ls detail >> >> > > ceph osd df >> >> > > ceph --version >> >> > > >> >> > > ceph -w for 10 seconds ( use http://pastebin.com/ please ) >> >> > > >> >> > > ceph osd crush dump ( also pastebin pls ) >> >> > > >> >> > > -- >> >> > > Mit freundlichen Gruessen / Best regards >> >> > > >> >> > > Oliver Dzombic >> >> > > IP-Interactive >> >> > > >> >> > > mailto:i...@ip-interactive.de >> >> <mailto:i...@ip-interactive.de> <i...@ip-interactive.de >> >> <mailto:i...@ip-interactive.de>> >> >> > > >> >> > > Anschrift: >> >> > > >> >> > > IP Interactive UG ( haftungsbeschraenkt ) >> >> > > Zum Sonnenberg 1-3 >> >> > > 63571 Gelnhausen >> >> > > >> >> > > HRB 93402 beim Amtsgericht Hanau >> >> > > Geschäftsführung: Oliver Dzombic >> >> > > >> >> > > Steuer Nr.: 35 236 3622 1 >> >> > > UST ID: DE274086107 >> >> > > >> >> > > >> >> > > Am 28.06.2016 um 18:59 schrieb Mario Giammarco: >> >> > > >> >> > > Hello, >> >> > > >> >> > > this is the second time that happens to me, I hope that >> >> someone can >> >> > > >> >> > > explain what I can do. >> >> > > >> >> > > Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1, >> >> size=2. >> >> > > >> >> > > >> >> > > One hdd goes down due to bad sectors. >> >> > > >> >> > > Ceph recovers but it ends with: >> >> > > >> >> > > >> >> > > cluster f2a8dd7d-949a-4a29-acab-11d4900249f4 >> >> > > >> >> > > health HEALTH_WARN >> >> > > >> >> > > 3 pgs down >> >> > > >> >> > > 19 pgs incomplete >> >> > > >> >> > > 19 pgs stuck inactive >> >> > > >> >> > > 19 pgs stuck unclean >> >> > > >> >> > > 7 requests are blocked > 32 sec >> >> > > >> >> > > monmap e11: 7 mons at >> >> > > >> >> > > {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0 >> >> <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>, >> >> > > >> >> > > >> >> 2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202 >> >> < >> http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>: >> >> > > >> >> > > 6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0 >> >> <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>} >> >> > > >> >> > > election epoch 722, quorum >> >> > > >> >> > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6 >> >> > > >> >> > > osdmap e10182: 10 osds: 10 up, 10 in >> >> > > >> >> > > pgmap v3295880: 1024 pgs, 2 pools, 4563 GB data, 1143 >> >> kobjects >> >> > > >> >> > > 9136 GB used, 5710 GB / 14846 GB avail >> >> > > >> >> > > 1005 active+clean >> >> > > >> >> > > 16 incomplete >> >> > > >> >> > > 3 down+incomplete >> >> > > >> >> > > >> >> > > Unfortunately "7 requests blocked" means no virtual >> >> machine can boot >> >> > > >> >> > > because ceph has stopped i/o. >> >> > > >> >> > > >> >> > > I can accept to lose some data, but not ALL data! >> >> > > >> >> > > Can you help me please? >> >> > > >> >> > > Thanks, >> >> > > >> >> > > Mario >> >> > > >> >> > > >> >> > > _______________________________________________ >> >> > > >> >> > > ceph-users mailing list >> >> > > >> >> > > ceph-users@lists.ceph.com <mailto: >> ceph-users@lists.ceph.com> >> >> > > >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > >> >> > > >> >> > > _______________________________________________ >> >> > > ceph-users mailing list >> >> > > ceph-users@lists.ceph.com <mailto: >> ceph-users@lists.ceph.com> >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > >> >> > > _______________________________________________ >> >> > > ceph-users mailing list >> >> > > ceph-users@lists.ceph.com <mailto: >> ceph-users@lists.ceph.com> >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > >> >> >> >> >> >> -- >> >> Christian Balzer Network/Systems Engineer >> >> ch...@gol.com <mailto:ch...@gol.com> Global OnLine >> >> Japan/Rakuten Communications >> >> http://www.gol.com/ >> >> >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com