Hello,

Le 24/10/2017 à 07:49, Brad Hubbard a écrit :


On Mon, Oct 23, 2017 at 4:51 PM, pascal.pu...@pci-conseil.net <mailto:pascal.pu...@pci-conseil.net> <pascal.pu...@pci-conseil.net <mailto:pascal.pu...@pci-conseil.net>> wrote:

    Hello,

    Le 23/10/2017 à 02:05, Brad Hubbard a écrit :
    2017-10-22 17:32:56.031086 7f3acaff5700 1 osd.14 pg_epoch: 72024
    pg[37.1c( v 71593'41657 (60849'38594,71593'41657] local-les=72023
    n=13 ec=7037 les/c/f 72023/72023/66447 72022/72022/72022)
    [14,1,41] r=0 lpr=72022 crt=71593'41657 lcod 0'
    0 mlcod 0'0 active+clean] hit_set_trim
    37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31
    01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head not found
    2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc:
    In function 'void
    ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned
    int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
    osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

    It appears to be looking for (and failing to find) a hitset
    object with a timestamp from August? Does that sound right to
    you? Of course, it appears an object for that timestamp does not
    exist.

    How is-it possible ? How to fix it. I am sure, if I run a lot of
    read, other objects like this will crash other osd.
    (Cluster is OK now, I will probably destroy OSD 14 and recreate it).
    How to find this object ?


You should be able to do a find on the OSDs filestore and grep the output for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs responsible for pg 37.1c and then move on to the others if it's feasible.

So with grep, I found OSD.14 (already destroyed anb recreated) and OSD.1.

ceph-osd-01: /var/log/ceph/ceph-osd.1.log-20171019.gz:2017-10-18 05:37:52.793802 7f9754ec5700 -1 osd.1 pg_epoch: 71592 pg[37.1c( v 71591'41652 (60849'38594,71591'41652] local-les=71583 n=17 ec=7037 les/c/f 71583/71554/66447 71561/71578/71578) [43,26,13]/[1,41] r=0 lpr=71578 pi=71553-71577/5 luod=71590'41651 bft=13,26,43 crt=71588'41647 lcod 71589'41650 mlcod 0'0 active+undersized+degraded+remapped+wait_backfill] agent_load_hit_sets: could not load hitset 37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31 01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head

May I destroy OSD 1 and recreate it as well  to force move ? or just reweight OSD to force move ?

How to find other objects with same issues ? (just restart rsync and see ?).

Other question  :I use to run a night crontab with fstrim on rbd disk. Is-it is it because of the problem ?

Let us know the results.


--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
pascal.pu...@pci-conseil.net <mailto:pascal.pu...@pci-conseil.net>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*
Parteneriat DataCore -PCI est Silver Partner <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/> Très heureux de réaliser des projets continuité stockage avec DataCore depuis 2008. PCI est partenaire Silver DataCore. Merci à DataCore ...lire...I <http://www.performance-conseil-informatique.net/2017/06/02/partenaire-datacore/>
/

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to