Hi, might be http://tracker.ceph.com/issues/22464
Can you check the OSD log file to see if the reported checksum is 0x6706be76? Paul > Am 28.02.2018 um 11:43 schrieb Marco Baldini - H.S. Amiata > <mbald...@hsamiata.it>: > > Hello > > I have a little ceph cluster with 3 nodes, each with 3x1TB HDD and 1x240GB > SSD. I created this cluster after Luminous release, so all OSDs are > Bluestore. In my crush map I have two rules, one targeting the SSDs and one > targeting the HDDs. I have 4 pools, one using the SSD rule and the others > using the HDD rule, three pools are size=3 min_size=2, one is size=2 > min_size=1 (this one have content that it's ok to lose) > > In the last 3 month I'm having a strange random problem. I planned my osd > scrubs during the night (osd scrub begin hour = 20, osd scrub end hour = 7) > when office is closed so there is low impact on the users. Some mornings, > when I ceph the cluster health, I find: > HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent > OSD_SCRUB_ERRORS X scrub errors > PG_DAMAGED Possible data damage: Y pg inconsistent > X and Y sometimes are 1, sometimes 2. > > I issue a ceph health detail, check the damaged PGs, and run a ceph pg repair > for the damaged PGs, I get > > instructing pg PG on osd.N to repair > PG are different, OSD that have to repair PG is different, even the node > hosting the OSD is different, I made a list of all PGs and OSDs. This morning > is the most recent case: > > > ceph health detail > HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent > OSD_SCRUB_ERRORS 2 scrub errors > PG_DAMAGED Possible data damage: 2 pgs inconsistent > pg 13.65 is active+clean+inconsistent, acting [4,2,6] > pg 14.31 is active+clean+inconsistent, acting [8,3,1] > > ceph pg repair 13.65 > instructing pg 13.65 on osd.4 to repair > > (node-2)> tail /var/log/ceph/ceph-osd.4.log > 2018-02-28 08:38:47.593447 7f112cf76700 0 log_channel(cluster) log [DBG] : > 13.65 repair starts > 2018-02-28 08:39:37.573342 7f112cf76700 0 log_channel(cluster) log [DBG] : > 13.65 repair ok, 0 fixed > > ceph pg repair 14.31 > instructing pg 14.31 on osd.8 to repair > > (node-3)> tail /var/log/ceph/ceph-osd.8.log > 2018-02-28 08:52:37.297490 7f4dd0816700 0 log_channel(cluster) log [DBG] : > 14.31 repair starts > 2018-02-28 08:53:00.704020 7f4dd0816700 0 log_channel(cluster) log [DBG] : > 14.31 repair ok, 0 fixed > > > I made a list of when I got OSD_SCRUB_ERRORS, what PG and what OSD had to > repair PG. Date is dd/mm/yyyy > 21/12/2017 -- pg 14.29 is active+clean+inconsistent, acting [6,2,4] > > 18/01/2018 -- pg 14.5a is active+clean+inconsistent, acting [6,4,1] > > 22/01/2018 -- pg 9.3a is active+clean+inconsistent, acting [2,7] > > 29/01/2018 -- pg 13.3e is active+clean+inconsistent, acting [4,6,1] > instructing pg 13.3e on osd.4 to repair > > 07/02/2018 -- pg 13.7e is active+clean+inconsistent, acting [8,2,5] > instructing pg 13.7e on osd.8 to repair > > 09/02/2018 -- pg 13.30 is active+clean+inconsistent, acting [7,3,2] > instructing pg 13.30 on osd.7 to repair > > 15/02/2018 -- pg 9.35 is active+clean+inconsistent, acting [1,8] > instructing pg 9.35 on osd.1 to repair > > pg 13.3e is active+clean+inconsistent, acting [4,6,1] > instructing pg 13.3e on osd.4 to repair > > 17/02/2018 -- pg 9.2d is active+clean+inconsistent, acting [7,5] > instructing pg 9.2d on osd.7 to repair > > 22/02/2018 -- pg 9.24 is active+clean+inconsistent, acting [5,8] > instructing pg 9.24 on osd.5 to repair > > 28/02/2018 -- pg 13.65 is active+clean+inconsistent, acting [4,2,6] > instructing pg 13.65 on osd.4 to repair > > pg 14.31 is active+clean+inconsistent, acting [8,3,1] > instructing pg 14.31 on osd.8 to repair > > > > If can be useful, my ceph.conf is here: > > [global] > auth client required = none > auth cluster required = none > auth service required = none > fsid = 24d5d6bc-0943-4345-b44e-46c19099004b > cluster network = 10.10.10.0/24 > public network = 10.10.10.0/24 > keyring = /etc/pve/priv/$cluster.$name.keyring > mon allow pool delete = true > osd journal size = 5120 > osd pool default min size = 2 > osd pool default size = 3 > bluestore_block_db_size = 64424509440 > > debug asok = 0/0 > debug auth = 0/0 > debug buffer = 0/0 > debug client = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug filer = 0/0 > debug filestore = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug journal = 0/0 > debug journaler = 0/0 > debug lockdep = 0/0 > debug mds = 0/0 > debug mds balancer = 0/0 > debug mds locker = 0/0 > debug mds log = 0/0 > debug mds log expire = 0/0 > debug mds migrator = 0/0 > debug mon = 0/0 > debug monc = 0/0 > debug ms = 0/0 > debug objclass = 0/0 > debug objectcacher = 0/0 > debug objecter = 0/0 > debug optracker = 0/0 > debug osd = 0/0 > debug paxos = 0/0 > debug perfcounter = 0/0 > debug rados = 0/0 > debug rbd = 0/0 > debug rgw = 0/0 > debug throttle = 0/0 > debug timer = 0/0 > debug tp = 0/0 > > > [osd] > keyring = /var/lib/ceph/osd/ceph-$id/keyring > osd max backfills = 1 > osd recovery max active = 1 > > osd scrub begin hour = 20 > osd scrub end hour = 7 > osd scrub during recovery = false > osd scrub load threshold = 0.3 > > [client] > rbd cache = true > rbd cache size = 268435456 # 256MB > rbd cache max dirty = 201326592 # 192MB > rbd cache max dirty age = 2 > rbd cache target dirty = 33554432 # 32MB > rbd cache writethrough until flush = true > > > #[mgr] > #debug_mgr = 20 > > > [mon.pve-hs-main] > host = pve-hs-main > mon addr = 10.10.10.251:6789 > > [mon.pve-hs-2] > host = pve-hs-2 > mon addr = 10.10.10.252:6789 > > [mon.pve-hs-3] > host = pve-hs-3 > mon addr = 10.10.10.253:6789 > > > My ceph versions: > > { > "mon": { > "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) > luminous (stable)": 3 > }, > "mgr": { > "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) > luminous (stable)": 3 > }, > "osd": { > "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) > luminous (stable)": 12 > }, > "mds": {}, > "overall": { > "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) > luminous (stable)": 18 > } > } > > > > My ceph osd tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 8.93686 root default > -6 2.94696 host pve-hs-2 > 3 hdd 0.90959 osd.3 up 1.00000 1.00000 > 4 hdd 0.90959 osd.4 up 1.00000 1.00000 > 5 hdd 0.90959 osd.5 up 1.00000 1.00000 > 10 ssd 0.21819 osd.10 up 1.00000 1.00000 > -3 2.86716 host pve-hs-3 > 6 hdd 0.85599 osd.6 up 1.00000 1.00000 > 7 hdd 0.85599 osd.7 up 1.00000 1.00000 > 8 hdd 0.93700 osd.8 up 1.00000 1.00000 > 11 ssd 0.21819 osd.11 up 1.00000 1.00000 > -7 3.12274 host pve-hs-main > 0 hdd 0.96819 osd.0 up 1.00000 1.00000 > 1 hdd 0.96819 osd.1 up 1.00000 1.00000 > 2 hdd 0.96819 osd.2 up 1.00000 1.00000 > 9 ssd 0.21819 osd.9 up 1.00000 1.00000 > > My pools: > > pool 9 'cephbackup' replicated size 2 min_size 1 crush_rule 1 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 5665 flags hashpspool stripe_width > 0 application rbd > removed_snaps [1~3] > pool 13 'cephwin' replicated size 3 min_size 2 crush_rule 1 object_hash > rjenkins pg_num 128 pgp_num 128 last_change 16454 flags hashpspool > stripe_width 0 application rbd > removed_snaps [1~5] > pool 14 'cephnix' replicated size 3 min_size 2 crush_rule 1 object_hash > rjenkins pg_num 128 pgp_num 128 last_change 16482 flags hashpspool > stripe_width 0 application rbd > removed_snaps [1~227] > pool 17 'cephssd' replicated size 3 min_size 2 crush_rule 2 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 8601 flags hashpspool stripe_width > 0 application rbd > removed_snaps [1~3] > > > I can't understand where the problem comes from, I don't think it's hardware, > if I had a failed disk, then I should have problems always on the same OSD. > Any ideas > > Thanks > > > -- > Marco Baldini > H.S. Amiata Srl > Ufficio: 0577-779396 > Cellulare: 335-8765169 > WEB: www.hsamiata.it <https://www.hsamiata.it/> > EMAIL: mbald...@hsamiata.it > <mailto:mbald...@hsamiata.it>_______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Mit freundlichen Grüßen / Best Regards Paul Emmerich croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 Geschäftsführer: Martin Verges Handelsregister: Amtsgericht München USt-IdNr: DE310638492
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com