After making that setting, the pg appeared to start peering but then it actually changed the primary OSD to osd.100 - then went incomplete again. Perhaps it did that because another OSD had more data? I presume i need to set that value on each osd where the pg hops to.
-Ben On Tue, Mar 8, 2016 at 10:39 AM, David Zafman <dzaf...@redhat.com> wrote: > > Ben, > > I haven't look at everything in your message, but pg 12.7a1 has lost data > because of writes that went only to osd.73. The way to recover this is to > force recovery to ignore this fact and go with whatever data you have on > the remaining OSDs. > I assume that having min_size 1, having multiple nodes failing and clients > continuing to write then permanently losing osd.73 caused this. > > You should TEMPORARILY set osd_find_best_info_ignore_history_les config > variable to 1 on osd.36 and then mark it down (ceph osd down), so it will > rejoin, re-peer and mark the pg active+clean. Don't forget to set > osd_find_best_info_ignore_history_les > back to 0. > > > Later you should fix your crush map. See > http://docs.ceph.com/docs/master/rados/operations/crush-map/ > > The wrong placements makes you vulnerable to a single host failure taking > out multiple copies of an object. > > David > > > On 3/7/16 9:41 PM, Ben Hines wrote: > > Howdy, > > I was hoping someone could help me recover a couple pgs which are causing > problems in my cluster. If we aren't able to resolve this soon, we may have > to just destroy them and lose some data. Recovery has so far been > unsuccessful. Data loss would probably cause some here to reconsider Ceph > as something we'll stick with long term, so i'd love to recover it. > > Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering > after a disk failure > > pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d > pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a > pg 10.4f query: https://gist.github.com/benh57/44bdd2a19ea667d920ab > ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7 > > - The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when > it went down, the pg was 'down + peering'. It was marked lost. > - After marking 73 lost, the new primary still wants to peer and flips > between peering and incomplete. > - Noticed '73' still shows in the pg query output for the bad pgs. (maybe i > need to bring back an osd with the same name?) > - Noticed that the new primary got set to an osd (osd-77) which was on the > same node as (osd-76) which had all the data. Figuring 77 couldn't peer > with 36 because it was on the same node, i set 77 out, 36 became primary > and 76 became one of the replicas. No change. > > startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20, > debug filestore = 30, debug ms = 1' (large files) > > osd 36 (12.7a1) startup > log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log > osd 6 (10.4f) startup > log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log > > > Some other Notes: > > - Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has > 12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a > copy of the data. Even after running a pg repair does not pick up the data > from 76, remains stuck peering > > - One of the pgs was part of a pool which was no longer needed. (the unused > radosgw .rgw.control pool, with one 0kb object in it) Per previous steps > discussed here for a similar failure, i attempted these recovery steps on > it, to see if they would work for the others: > > -- The failed osd disk only mounts 'read only' which causes > ceph-objectstore-tool to fail to export, so i exported it from a seemingly > good copy on another osd. > -- stopped all osds > -- exported the pg with objectstore-tool from an apparently good OSD > -- removed the pg from all osds which had it using objectstore-tool > -- imported the pg into an out osd, osd-100 > > Importing pgid 4.95 > Write 4/88aa5c95/notify.2/head > Import successful > > -- Force recreated the pg on the cluster: > ceph pg force_create_pg 4.95 > -- brought up all osds > -- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects > > However, the object doesn't sync to the pg from osd-100, and instead 64 > tells to to remove itself from osd-100: > > 2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch > 0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2 > 2016-03-05 15:44:22.858174 7fc004168700 7 osd.100 68025 handle_pg_remove > from osd.64 on 1 pgs > 2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025 > require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660 > 2016-03-05 15:44:22.858188 7fc004168700 5 osd.100 68025 > queue_pg_for_deletion: 4.95 > 2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history > 4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0 > 66982/67983/66982 > > Not wanting this to happen to my needed data from the other PGs, i didn't > try this procedure with those PGs. After this procedure osd-100 does get > listed in 'pg query' as 'might_have_unfound', but ceph apparently decides > not to use it and the active osd sends a remove. > > output of 'ceph pg 4.95 query' after these recovery > steps:https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf > > > Quite Possibly Related: > > I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems > my crush map thinks some osds are on the wrong hosts. I wonder if this is > why peering is failing? > (example) > -5 9.04999 host cld-mtl-006 > 12 1.81000 osd.12 up 1.00000 1.00000 > 13 1.81000 osd.13 up 1.00000 1.00000 > 14 1.81000 osd.14 up 1.00000 1.00000 > 94 1.81000 osd.94 up 1.00000 1.00000 > 26 1.81000 osd.26 up 0.86775 1.00000 > > ^^ this host only has 4 osds on it! osd.26 is actually running over on > cld-mtl-004 ! Restarting 26 fixed the map. > osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on > cld-mtl-013, it's actually on cld-mtl-024. > - fixing these issues caused a large re-balance, so 'ceph health detail' is > a bit dirty right now, but you can see the stuck pgs: > ceph health detail: > > - I wonder if these incorrect crushmaps caused ceph to put some data on > the wrong osds, resulting in a peering failure later when the map repaired > itself? > - How does ceph determine what node an OSD is on? That process may be > periodically failing due to some issue. (dns?) > - Perhaps if i enable 'allow peer to same host' setting, the cluster could > repair? Then i could turn it off again. > > > Any assistance is appreciated! > > -Ben > > > > > _______________________________________________ > ceph-users mailing > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com