Re: [ceph-users] Another cluster completely hang

Mario Giammarco Wed, 29 Jun 2016 00:44:40 -0700

I have read many times the post "incomplete pgs, oh my"
I think my case is different.
The broken disk is completely broken.
So how can I simply mark incomplete pgs as complete?
Should I stop ceph before?



Il giorno mer 29 giu 2016 alle ore 09:36 Tomasz Kuzemko <
[email protected]> ha scritto:

> Hi,
> if you need fast access to your remaining data you can use
> ceph-objectstore-tool to mark those PGs as complete, however this will
> irreversibly lose the missing data.
>
> If you understand the risks, this procedure is pretty good explained here:
> http://ceph.com/community/incomplete-pgs-oh-my/
>
> Since this article was written, ceph-objectstore-tool gained a feature
> that was not available at that time, that is "--op mark-complete". I
> think it will be necessary in your case to call --op mark-complete after
> you import the PG to temporary OSD (between steps 12 and 13).
>
> On 29.06.2016 09:09, Mario Giammarco wrote:
> > Now I have also discovered that, by mistake, someone has put production
> > data on a virtual machine of the cluster. I need that ceph starts I/O so
> > I can boot that virtual machine.
> > Can I mark the incomplete pgs as valid?
> > If needed, where can I buy some paid support?
> > Thanks again,
> > Mario
> >
> > Il giorno mer 29 giu 2016 alle ore 08:02 Mario Giammarco
> > <[email protected] <mailto:[email protected]>> ha scritto:
> >
> >     pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0
> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 9313 flags
> >     hashpspool stripe_width 0
> >            removed_snaps [1~3]
> >     pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0
> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 9314 flags
> >     hashpspool stripe_width 0
> >            removed_snaps [1~3]
> >     pool 2 'rbd3' replicated size 2 min_size 1 crush_ruleset 0
> >     object_hash rjenkins pg_num 512 pgp_num 512 last_change 10537 flags
> >     hashpspool stripe_width 0
> >            removed_snaps [1~3]
> >
> >
> >     ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR
> >     5 1.81000  1.00000  1857G  984G  872G 53.00 0.86
> >     6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05
> >     2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01
> >     3 1.35999  1.00000  1391G  906G  485G 65.12 1.06
> >     4 0.89999  1.00000   926G  702G  223G 75.88 1.23
> >     7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93
> >     8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88
> >     9 0.89999  1.00000   926G  573G  352G 61.91 1.01
> >     0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07
> >     13 0.45000  1.00000   460G  307G  153G 66.74 1.08
> >                  TOTAL 14846G 9136G 5710G 61.54
> >     MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47
> >
> >
> >
> >     ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
> >
> >     http://pastebin.com/SvGfcSHb
> >     http://pastebin.com/gYFatsNS
> >     http://pastebin.com/VZD7j2vN
> >
> >     I do not understand why I/O on ENTIRE cluster is blocked when only
> >     few pgs are incomplete.
> >
> >     Many thanks,
> >     Mario
> >
> >
> >     Il giorno mar 28 giu 2016 alle ore 19:34 Stefan Priebe - Profihost
> >     AG <[email protected] <mailto:[email protected]>> ha
> scritto:
> >
> >         And ceph health detail
> >
> >         Stefan
> >
> >         Excuse my typo sent from my mobile phone.
> >
> >         Am 28.06.2016 um 19:28 schrieb Oliver Dzombic
> >         <[email protected] <mailto:[email protected]>>:
> >
> >>         Hi Mario,
> >>
> >>         please give some more details:
> >>
> >>         Please the output of:
> >>
> >>         ceph osd pool ls detail
> >>         ceph osd df
> >>         ceph --version
> >>
> >>         ceph -w for 10 seconds ( use http://pastebin.com/ please )
> >>
> >>         ceph osd crush dump ( also pastebin pls )
> >>
> >>         --
> >>         Mit freundlichen Gruessen / Best regards
> >>
> >>         Oliver Dzombic
> >>         IP-Interactive
> >>
> >>         mailto:[email protected]
> >>
> >>         Anschrift:
> >>
> >>         IP Interactive UG ( haftungsbeschraenkt )
> >>         Zum Sonnenberg 1-3
> >>         63571 Gelnhausen
> >>
> >>         HRB 93402 beim Amtsgericht Hanau
> >>         Geschäftsführung: Oliver Dzombic
> >>
> >>         Steuer Nr.: 35 236 3622 1
> >>         UST ID: DE274086107
> >>
> >>
> >>         Am 28.06.2016 um 18:59 schrieb Mario Giammarco:
> >>>         Hello,
> >>>         this is the second time that happens to me, I hope that
> >>>         someone can
> >>>         explain what I can do.
> >>>         Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1,
> size=2.
> >>>
> >>>         One hdd goes down due to bad sectors.
> >>>         Ceph recovers but it ends with:
> >>>
> >>>         cluster f2a8dd7d-949a-4a29-acab-11d4900249f4
> >>>             health HEALTH_WARN
> >>>                    3 pgs down
> >>>                    19 pgs incomplete
> >>>                    19 pgs stuck inactive
> >>>                    19 pgs stuck unclean
> >>>                    7 requests are blocked > 32 sec
> >>>             monmap e11: 7 mons at
> >>>         {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0
> >>>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,
> >>>         2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202
> >>>         <
> http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:
> >>>         6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0
> >>>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}
> >>>                    election epoch 722, quorum
> >>>         0,1,2,3,4,5,6 1,4,2,0,3,5,6
> >>>             osdmap e10182: 10 osds: 10 up, 10 in
> >>>              pgmap v3295880: 1024 pgs, 2 pools, 4563 GB data, 1143
> >>>         kobjects
> >>>                    9136 GB used, 5710 GB / 14846 GB avail
> >>>                        1005 active+clean
> >>>                          16 incomplete
> >>>                           3 down+incomplete
> >>>
> >>>         Unfortunately "7 requests blocked" means no virtual machine
> >>>         can boot
> >>>         because ceph has stopped i/o.
> >>>
> >>>         I can accept to lose some data, but not ALL data!
> >>>         Can you help me please?
> >>>         Thanks,
> >>>         Mario
> >>>
> >>>         _______________________________________________
> >>>         ceph-users mailing list
> >>>         [email protected] <mailto:[email protected]>
> >>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>         _______________________________________________
> >>         ceph-users mailing list
> >>         [email protected] <mailto:[email protected]>
> >>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >         _______________________________________________
> >         ceph-users mailing list
> >         [email protected] <mailto:[email protected]>
> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Tomasz Kuzemko
> [email protected]
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

Reply via email to