After making that setting, the pg appeared to start peering but then it
actually changed the primary OSD to osd.100 - then went incomplete again.
Perhaps it did that because another OSD had more data? I presume i need to
set that value on each osd where the pg hops to.

-Ben

On Tue, Mar 8, 2016 at 10:39 AM, David Zafman <dzaf...@redhat.com> wrote:

>
> Ben,
>
> I haven't look at everything in your message, but pg 12.7a1 has lost data
> because of writes that went only to osd.73.  The way to recover this is to
> force recovery to ignore this fact and go with whatever data you have on
> the remaining OSDs.
> I assume that having min_size 1, having multiple nodes failing and clients
> continuing to write then permanently losing osd.73 caused this.
>
> You should TEMPORARILY set osd_find_best_info_ignore_history_les config
> variable to 1 on osd.36 and then mark it down (ceph osd down), so it will
> rejoin, re-peer and mark the pg active+clean.  Don't forget to set
> osd_find_best_info_ignore_history_les
> back to 0.
>
>
> Later you should fix your crush map.  See
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
> The wrong placements makes you vulnerable to a single host failure taking
> out multiple copies of an object.
>
> David
>
>
> On 3/7/16 9:41 PM, Ben Hines wrote:
>
> Howdy,
>
> I was hoping someone could help me recover a couple pgs which are causing
> problems in my cluster. If we aren't able to resolve this soon, we may have
> to just destroy them and lose some data. Recovery has so far been
> unsuccessful. Data loss would probably cause some here to reconsider Ceph
> as something we'll stick with long term, so i'd love to recover it.
>
> Ceph 9.2.1. I have 4 (well, 3 now) pgs which are incomplete + stuck peering
> after a disk failure
>
> pg 12.7a1 query: https://gist.github.com/benh57/ba4f96103e1f6b3b7a4d
> pg 12.7b query: https://gist.github.com/benh57/8db0bfccc5992b9ca71a
> pg 10.4f query:  https://gist.github.com/benh57/44bdd2a19ea667d920ab
> ceph osd tree: https://gist.github.com/benh57/9fc46051a0f09b6948b7
>
> - The bad OSD (osd-73) was on mtl-024. There were no 'unfound' objects when
> it went down, the pg was 'down + peering'. It was marked lost.
> - After marking 73 lost, the new primary still wants to peer and flips
> between peering and incomplete.
> - Noticed '73' still shows in the pg query output for the bad pgs. (maybe i
> need to bring back an osd with the same name?)
> - Noticed that the new primary got set to an osd (osd-77) which was on the
> same node as (osd-76) which had all the data.  Figuring 77 couldn't peer
> with 36 because it was on the same node, i set 77 out, 36 became primary
> and 76 became one of the replicas. No change.
>
> startup logs of Primaries of bad pgs (12.7a1, 10.4f) with 'debug osd = 20,
> debug filestore = 30, debug ms = 1'  (large files)
>
> osd 36 (12.7a1) startup 
> log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.36.log
> osd 6 (10.4f) startup 
> log:https://raw.githubusercontent.com/benh57/cephdebugging/master/ceph-osd.6.log
>
>
> Some other Notes:
>
> - Searching for OSDs which had data in 12.7a1_head, i found that osd-76 has
> 12G, but primary osd-36 has 728M. Another OSD which is out (100) also has a
> copy of the data.  Even after running a pg repair does not pick up the data
> from 76, remains stuck peering
>
> - One of the pgs was part of a pool which was no longer needed. (the unused
> radosgw .rgw.control pool, with one 0kb object in it) Per previous steps
> discussed here for a similar failure, i attempted these recovery steps on
> it, to see if they would work for the others:
>
> -- The failed osd disk only mounts 'read only' which causes
> ceph-objectstore-tool to fail to export, so i exported it from a seemingly
> good copy on another osd.
> -- stopped all osds
> -- exported the pg with objectstore-tool from an apparently good OSD
> -- removed the pg from all osds which had it using objectstore-tool
> -- imported the pg into an out osd, osd-100
>
>   Importing pgid 4.95
> Write 4/88aa5c95/notify.2/head
> Import successful
>
> -- Force recreated the pg on the cluster:
>            ceph pg force_create_pg 4.95
> -- brought up all osds
> -- new pg 4.95 primary gets set to osd-99 + osd-64, 0 objects
>
> However, the object doesn't sync to the pg from osd-100, and instead 64
> tells to to remove itself from osd-100:
>
> 2016-03-05 15:44:22.858147 7fc004168700 20 osd.100 68025 _dispatch
> 0x7fc020867660 osd pg remove(epoch 68025; pg4.95; ) v2
> 2016-03-05 15:44:22.858174 7fc004168700  7 osd.100 68025 handle_pg_remove
> from osd.64 on 1 pgs
> 2016-03-05 15:44:22.858176 7fc004168700 15 osd.100 68025
> require_same_or_newer_map 68025 (i am 68025) 0x7fc020867660
> 2016-03-05 15:44:22.858188 7fc004168700  5 osd.100 68025
> queue_pg_for_deletion: 4.95
> 2016-03-05 15:44:22.858228 7fc004168700 15 osd.100 68025 project_pg_history
> 4.95 from 68025 to 68025, start ec=76 les/c/f 62655/62611/0
> 66982/67983/66982
>
> Not wanting this to happen to my needed data from the other PGs, i didn't
> try this procedure with those PGs. After this procedure  osd-100 does get
> listed in 'pg query' as 'might_have_unfound', but ceph apparently decides
> not to use it and the active osd sends a remove.
>
> output of 'ceph pg 4.95 query' after these recovery 
> steps:https://gist.github.com/benh57/fc9a847cd83f4d5e4dcf
>
>
> Quite Possibly Related:
>
> I am occasionally noticing some incorrectness in 'ceph osd tree'. It seems
> my crush map thinks some osds are on the wrong hosts. I wonder if this is
> why peering is failing?
> (example)
>  -5   9.04999     host cld-mtl-006
>  12   1.81000         osd.12               up  1.00000          1.00000
>  13   1.81000         osd.13               up  1.00000          1.00000
>  14   1.81000         osd.14               up  1.00000          1.00000
>  94   1.81000         osd.94               up  1.00000          1.00000
>  26   1.81000         osd.26               up  0.86775          1.00000
>
> ^^ this host only has 4 osds on it! osd.26 is actually running over on
> cld-mtl-004 !    Restarting 26 fixed the map.
> osd.42 (out) was also in the wrong place in 'osd tree'. tree syas it's on
> cld-mtl-013, it's actually on cld-mtl-024.
> - fixing these issues caused a large re-balance, so 'ceph health detail' is
> a bit dirty right now, but you can see the stuck pgs:
> ceph health detail:
>
> -  I wonder if these incorrect crushmaps caused ceph to put some data on
> the wrong osds, resulting in a peering failure later when the map repaired
> itself?
> -  How does ceph determine what node an OSD is on? That process may be
> periodically failing due to some issue. (dns?)
> -  Perhaps if i enable 'allow peer to same host' setting, the cluster could
> repair? Then i could turn it off again.
>
>
> Any assistance is appreciated!
>
> -Ben
>
>
>
>
> _______________________________________________
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to