Re: [ceph-users] strange unfounding of PGs

Csaba Tóth Tue, 14 Jun 2016 00:10:40 -0700

Hi Nick!
Yes i did. :(
Do you know how can i fix it?


Nick Fisk <[email protected]> ezt írta (időpont: 2016. jún. 14., K, 7:52):

> Did you enable the sortbitwise flag as per the upgrade instructions, as
> there is a known bug with it? I don't know why these instructions haven't
> been amended in light of this bug.
>
> http://tracker.ceph.com/issues/16113
>
>
>
> > -----Original Message-----
> > From: ceph-users [mailto:[email protected]] On Behalf Of
> > Csaba Tóth
> > Sent: 13 June 2016 16:17
> > To: [email protected]
> > Subject: [ceph-users] strange unfounding of PGs
> >
> > Hi!
> >
> > I have a soo strange problem. At friday night i upgraded my small ceph
> cluster
> > from hammer to jewel. Everything went so well, but the chowning of osd
> > datadir took a lot time, so i skipped two osd and do the run-as-root
> trick.
> > Yesterday evening i wanted to fix this, shutted down the first OSD and
> > chowned the lib/ceph dir. But when i started it back a lot of strange pg
> not
> > found error happened (this is just a small list):
> >
> > 2016-06-12 23:43:05.096078 osd.2 [ERR] 5.3d has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.096915 osd.2 [ERR] 5.30 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.097702 osd.2 [ERR] 5.39 has 4 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.100449 osd.2 [ERR] 5.2f has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.104519 osd.2 [ERR] 1.8 has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.106041 osd.2 [ERR] 5.3f has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.107379 osd.2 [ERR] 1.76 has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.107630 osd.2 [ERR] 1.0 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.107661 osd.2 [ERR] 2.14 has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.107722 osd.2 [ERR] 2.3 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.108082 osd.2 [ERR] 5.16 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.108417 osd.2 [ERR] 5.38 has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.108910 osd.2 [ERR] 1.43 has 3 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.109561 osd.2 [ERR] 1.a has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.110299 osd.2 [ERR] 1.10 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.111781 osd.2 [ERR] 1.22 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.111869 osd.2 [ERR] 1.1a has 3 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.205688 osd.4 [ERR] 1.29 has 2 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.206016 osd.4 [ERR] 1.1c has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.206219 osd.4 [ERR] 5.24 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.209013 osd.4 [ERR] 1.6a has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.209421 osd.4 [ERR] 1.68 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.209597 osd.4 [ERR] 5.d has 3 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.209620 osd.4 [ERR] 1.9 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.210191 osd.4 [ERR] 5.62 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.210649 osd.4 [ERR] 2.57 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.212011 osd.4 [ERR] 1.6 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.212106 osd.4 [ERR] 2.b has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.212212 osd.4 [ERR] 5.8 has 1 objects unfound and
> > apparently lost
> > 2016-06-12 23:43:05.215850 osd.4 [ERR] 2.56 has 2 objects unfound and
> > apparently lost
> >
> >
> > After this error messages i see this ceph health:
> > 2016-06-12 23:44:10.498613 7f5941e0f700  0 log_channel(cluster) log
> [INF] :
> > pgmap v23122505: 820 pgs: 1 peering, 37 active+degraded, 5
> > active+remapped+wait_backfill, 167 active+recovery_wait+degraded, 1
> > active+remapped, 1 active+recovering+degraded, 13
> > active+undersized+degraded+remapped+wait_backfill, 595 active+clean; 795
> > GB data, 1926 GB used, 5512 GB / 7438 GB avail; 7695 B/s wr, 2 op/s;
> > 24459/3225218 objects degraded (0.758%); 44435/3225218 objects misplaced
> > (1.378%); 346/1231022 unfound (0.028%)
> >
> > Some minutes later it stalled in this state:
> > 2016-06-13 00:07:32.761265 7f5941e0f700  0 log_channel(cluster) log
> [INF] :
> > pgmap v23123311: 820 pgs: 1
> > active+recovery_wait+undersized+degraded+remapped, 1
> > active+recovering+degraded, 11
> > active+undersized+degraded+remapped+wait_backfill, 5
> > active+remapped+wait_backfill, 207 active+recovery_wait+degraded, 595
> > active+clean; 795 GB data, 1878 GB used, 5559 GB / 7438 GB avail; 14164
> B/s
> > wr, 3 op/s; 22562/3223912 objects degraded (0.700%); 38738/3223912
> objects
> > misplaced (1.202%); 566/1231222 unfound (0.046%)
> >
> > But if i shut that OSD down i see this health (actually ceph stall in
> this state
> > and do nothing):
> > 2016-06-13 16:47:59.033552 mon.0 [INF] pgmap v23153361: 820 pgs: 32
> > active+recovery_wait+degraded, 1 active+recovering+degraded, 402
> > active+undersized+degraded+remapped+wait_backfill, 385 active+clean; 796
> > GB data, 1420 GB used, 4160 GB / 5581 GB avail; 10110 B/s rd, 1098 kB/s
> wr,
> > 253 op/s; 692323/3215439 objects degraded (21.531%); 684099/3215439
> > objects misplaced (21.275%); 2/1231399 unfound (0.000%)
> >
> > So i kept shutted down that OSD... that way my cluster has only 2 unfound
> > object...
> >
> > There are much more unfound objects when the OSD is up than if i shut it
> > down. I don't understand this, please help me what to do to fix this.
> > Actually every RBD is reachable (but one virtual host crashed during the
> > night), but some object in my CephFS starts to be unavailable
> >
> > I read about the ceph-objectsore-tool and look if i can fix anything,
> here its
> > an output of fix-lost operation, if helps:
> > root@c22:/var/lib/ceph# sudo -u ceph ceph-objectstore-tool --op
> fix-lost --
> > dry-run --data-path /var/lib/ceph/osd/ceph-0
> > Error getting attr on : 1.48_head,#-3:12000000:::scrub_1.48:head#, (61)
> No
> > data available
> > Error getting attr on : 1.79_head,#-3:9e000000:::scrub_1.79:head#, (61)
> No
> > data available
> > Error getting attr on : 2.53_head,#-4:ca000000:::scrub_2.53:head#, (61)
> No
> > data available
> > Error getting attr on : 2.6b_head,#-4:d6000000:::scrub_2.6b:head#, (61)
> No
> > data available
> > Error getting attr on : 2.73_head,#-4:ce000000:::scrub_2.73:head#, (61)
> No
> > data available
> > Error getting attr on : 4.16_head,#-6:68000000:::scrub_4.16:head#, (61)
> No
> > data available
> > Error getting attr on : 4.2d_head,#-6:b4000000:::scrub_4.2d:head#, (61)
> No
> > data available
> > Error getting attr on : 4.55_head,#-6:aa000000:::scrub_4.55:head#, (61)
> No
> > data available
> > Error getting attr on : 4.57_head,#-6:ea000000:::scrub_4.57:head#, (61)
> No
> > data available
> > Error getting attr on : 6.17_head,#-8:e8000000:::scrub_6.17:head#, (61)
> No
> > data available
> > Error getting attr on : 6.46_head,#-8:62000000:::scrub_6.46:head#, (61)
> No
> > data available
> > Error getting attr on : 6.53_head,#-8:ca000000:::scrub_6.53:head#, (61)
> No
> > data available
> > Error getting attr on : 6.62_head,#-8:46000000:::scrub_6.62:head#, (61)
> No
> > data available
> > dry-run: Nothing changed
> >
> > originally i wanted to run the "ceph-objectstore-tool --op
> filestore-repair-
> > orphan-links" what Sam suggested, but the latest 9.2.1 ceph didn't
> contained
> > that.
> >
> > Thanks in advance!
> > Csaba
>
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] strange unfounding of PGs

Reply via email to