Re: [ceph-users] Another cluster completely hang

Mario Giammarco Wed, 29 Jun 2016 03:02:04 -0700

Now the problem is that ceph has put out two disks because scrub  has
failed (I think it is not a disk fault but due to mark-complete)
How can I:
- disable scrub
- put in again the two disks


I will wait anyway the end of recovery to be sure it really works again

Il giorno mer 29 giu 2016 alle ore 11:16 Mario Giammarco <
mgiamma...@gmail.com> ha scritto:

> Infact I am worried because:
>
> 1) ceph is under proxmox, and proxmox may decide to reboot a server if it
> is not responding
> 2) probably a server was rebooted while ceph was reconstructing
> 3) even using max=3 do not help
>
> Anyway this is the "unofficial" procedure that I am using, much simpler
> than blog post:
>
> 1) find host where is pg
> 2) stop ceph in that host
> 3) ceph-objectstore-tool --pgid 1.98 --op mark-complete --data-path
> /var/lib/ceph/osd/ceph-9 --journal-path /var/lib/ceph/osd/ceph-9/journal
> 4) start ceph
> 5) look finally it reconstructing
>
> Il giorno mer 29 giu 2016 alle ore 11:11 Oliver Dzombic <
> i...@ip-interactive.de> ha scritto:
>
>> Hi,
>>
>> removing ONE disk while your replication is 2, is no problem.
>>
>> You dont need to wait a single second to replace of remove it. Its
>> anyway not used and out/down. So from ceph's point of view its not
>> existent.
>>
>> ----------------
>>
>> But as christian told you already, what we see now fits to a szenario
>> where you lost the osd and eighter you did something, or something else
>> happens, but the data were not recovered again.
>>
>> Eighter because another OSD was broken, or because you did something.
>>
>> Maybe, because of the "too many PGs per OSD (307 > max 300)" ceph never
>> recovered.
>>
>> What i can see from http://pastebin.com/VZD7j2vN is that
>>
>> OSD 5,13,9,0,6,2,3 and maybe others, are the OSD's holding the
>> incomplete data.
>>
>> This are 7 OSD's from 10. So something happend to that OSD's or the data
>> in them. And that had nothing to do with a single disk failing.
>>
>> Something else must have been happend.
>>
>> And as christian already wrote: you will have to go through your logs
>> back until the point were things going down.
>>
>> Because a fail of a single OSD, no matter what your replication size is,
>> can ( normally ) not harm the consistency of 7 other OSD's, means 70% of
>> your total cluster.
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 29.06.2016 um 10:56 schrieb Mario Giammarco:
>> > Yes I have removed it from crush because it was broken. I have waited 24
>> > hours to see if cephs would like to heals itself. Then I removed the
>> > disk completely (it was broken...) and I waited 24 hours again. Then I
>> > start getting worried.
>> > Are you saying to me that I should not remove a broken disk from
>> > cluster? 24 hours were not enough?
>> >
>> > Il giorno mer 29 giu 2016 alle ore 10:53 Zoltan Arnold Nagy
>> > <zol...@linux.vnet.ibm.com <mailto:zol...@linux.vnet.ibm.com>> ha
>> scritto:
>> >
>> >     Just loosing one disk doesn’t automagically delete it from CRUSH,
>> >     but in the output you had 10 disks listed, so there must be
>> >     something else going - did you delete the disk from the crush map as
>> >     well?
>> >
>> >     Ceph waits by default 300 secs AFAIK to mark an OSD out after it
>> >     will start to recover.
>> >
>> >
>> >>     On 29 Jun 2016, at 10:42, Mario Giammarco <mgiamma...@gmail.com
>> >>     <mailto:mgiamma...@gmail.com>> wrote:
>> >>
>> >>     I thank you for your reply so I can add my experience:
>> >>
>> >>     1) the other time this thing happened to me I had a cluster with
>> >>     min_size=2 and size=3 and the problem was the same. That time I
>> >>     put min_size=1 to recover the pool but it did not help. So I do
>> >>     not understand where is the advantage to put three copies when
>> >>     ceph can decide to discard all three.
>> >>     2) I started with 11 hdds. The hard disk failed. Ceph waited
>> >>     forever for hard disk coming back. But hard disk is really
>> >>     completelly broken so I have followed the procedure to really
>> >>     delete from cluster. Anyway ceph did not recover.
>> >>     3) I have 307 pgs more than 300 but it is due to the fact that I
>> >>     had 11 hdds now only 10. I will add more hdds after I repair the
>> pool
>> >>     4) I have reduced the monitors to 3
>> >>
>> >>
>> >>
>> >>     Il giorno mer 29 giu 2016 alle ore 10:25 Christian Balzer
>> >>     <ch...@gol.com <mailto:ch...@gol.com>> ha scritto:
>> >>
>> >>
>> >>         Hello,
>> >>
>> >>         On Wed, 29 Jun 2016 06:02:59 +0000 Mario Giammarco wrote:
>> >>
>> >>         > pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0
>> >>         object_hash
>> >>                                        ^
>> >>         And that's the root cause of all your woes.
>> >>         The default replication size is 3 for a reason and while I do
>> >>         run pools
>> >>         with replication of 2 they are either HDD RAIDs or extremely
>> >>         trustworthy
>> >>         and well monitored SSD.
>> >>
>> >>         That said, something more than a single HDD failure must have
>> >>         happened
>> >>         here, you should check the logs and backtrace all the step you
>> >>         did after
>> >>         that OSD failed.
>> >>
>> >>         You said there were 11 HDDs and your first ceph -s output
>> showed:
>> >>         ---
>> >>              osdmap e10182: 10 osds: 10 up, 10 in
>> >>         ----
>> >>         And your crush map states the same.
>> >>
>> >>         So how and WHEN did you remove that OSD?
>> >>         My suspicion would be it was removed before recovery was
>> complete.
>> >>
>> >>         Also, as I think was mentioned before, 7 mons are overkill 3-5
>> >>         would be a
>> >>         saner number.
>> >>
>> >>         Christian
>> >>
>> >>         > rjenkins pg_num 512 pgp_num 512 last_change 9313 flags
>> >>         hashpspool
>> >>         > stripe_width 0
>> >>         >        removed_snaps [1~3]
>> >>         > pool 1 'rbd2' replicated size 2 min_size 1 crush_ruleset 0
>> >>         object_hash
>> >>         > rjenkins pg_num 512 pgp_num 512 last_change 9314 flags
>> >>         hashpspool
>> >>         > stripe_width 0
>> >>         >        removed_snaps [1~3]
>> >>         > pool 2 'rbd3' replicated size 2 min_size 1 crush_ruleset 0
>> >>         object_hash
>> >>         > rjenkins pg_num 512 pgp_num 512 last_change 10537 flags
>> >>         hashpspool
>> >>         > stripe_width 0
>> >>         >        removed_snaps [1~3]
>> >>         >
>> >>         >
>> >>         > ID WEIGHT  REWEIGHT SIZE   USE   AVAIL %USE  VAR
>> >>         > 5 1.81000  1.00000  1857G  984G  872G 53.00 0.86
>> >>         > 6 1.81000  1.00000  1857G 1202G  655G 64.73 1.05
>> >>         > 2 1.81000  1.00000  1857G 1158G  698G 62.38 1.01
>> >>         > 3 1.35999  1.00000  1391G  906G  485G 65.12 1.06
>> >>         > 4 0.89999  1.00000   926G  702G  223G 75.88 1.23
>> >>         > 7 1.81000  1.00000  1857G 1063G  793G 57.27 0.93
>> >>         > 8 1.81000  1.00000  1857G 1011G  846G 54.44 0.88
>> >>         > 9 0.89999  1.00000   926G  573G  352G 61.91 1.01
>> >>         > 0 1.81000  1.00000  1857G 1227G  629G 66.10 1.07
>> >>         > 13 0.45000  1.00000   460G  307G  153G 66.74 1.08
>> >>         >              TOTAL 14846G 9136G 5710G 61.54
>> >>         > MIN/MAX VAR: 0.86/1.23  STDDEV: 6.47
>> >>         >
>> >>         >
>> >>         >
>> >>         > ceph version 0.94.7
>> (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>> >>         >
>> >>         > http://pastebin.com/SvGfcSHb
>> >>         > http://pastebin.com/gYFatsNS
>> >>         > http://pastebin.com/VZD7j2vN
>> >>         >
>> >>         > I do not understand why I/O on ENTIRE cluster is blocked
>> >>         when only few
>> >>         > pgs are incomplete.
>> >>         >
>> >>         > Many thanks,
>> >>         > Mario
>> >>         >
>> >>         >
>> >>         > Il giorno mar 28 giu 2016 alle ore 19:34 Stefan Priebe -
>> >>         Profihost AG <
>> >>         > s.pri...@profihost.ag <mailto:s.pri...@profihost.ag>> ha
>> >>         scritto:
>> >>         >
>> >>         > > And ceph health detail
>> >>         > >
>> >>         > > Stefan
>> >>         > >
>> >>         > > Excuse my typo sent from my mobile phone.
>> >>         > >
>> >>         > > Am 28.06.2016 um 19:28 schrieb Oliver Dzombic
>> >>         <i...@ip-interactive.de <mailto:i...@ip-interactive.de>>:
>> >>         > >
>> >>         > > Hi Mario,
>> >>         > >
>> >>         > > please give some more details:
>> >>         > >
>> >>         > > Please the output of:
>> >>         > >
>> >>         > > ceph osd pool ls detail
>> >>         > > ceph osd df
>> >>         > > ceph --version
>> >>         > >
>> >>         > > ceph -w for 10 seconds ( use http://pastebin.com/ please )
>> >>         > >
>> >>         > > ceph osd crush dump ( also pastebin pls )
>> >>         > >
>> >>         > > --
>> >>         > > Mit freundlichen Gruessen / Best regards
>> >>         > >
>> >>         > > Oliver Dzombic
>> >>         > > IP-Interactive
>> >>         > >
>> >>         > > mailto:i...@ip-interactive.de
>> >>         <mailto:i...@ip-interactive.de> <i...@ip-interactive.de
>> >>         <mailto:i...@ip-interactive.de>>
>> >>         > >
>> >>         > > Anschrift:
>> >>         > >
>> >>         > > IP Interactive UG ( haftungsbeschraenkt )
>> >>         > > Zum Sonnenberg 1-3
>> >>         > > 63571 Gelnhausen
>> >>         > >
>> >>         > > HRB 93402 beim Amtsgericht Hanau
>> >>         > > Geschäftsführung: Oliver Dzombic
>> >>         > >
>> >>         > > Steuer Nr.: 35 236 3622 1
>> >>         > > UST ID: DE274086107
>> >>         > >
>> >>         > >
>> >>         > > Am 28.06.2016 um 18:59 schrieb Mario Giammarco:
>> >>         > >
>> >>         > > Hello,
>> >>         > >
>> >>         > > this is the second time that happens to me, I hope that
>> >>         someone can
>> >>         > >
>> >>         > > explain what I can do.
>> >>         > >
>> >>         > > Proxmox ceph cluster with 8 servers, 11 hdd. Min_size=1,
>> >>         size=2.
>> >>         > >
>> >>         > >
>> >>         > > One hdd goes down due to bad sectors.
>> >>         > >
>> >>         > > Ceph recovers but it ends with:
>> >>         > >
>> >>         > >
>> >>         > > cluster f2a8dd7d-949a-4a29-acab-11d4900249f4
>> >>         > >
>> >>         > >     health HEALTH_WARN
>> >>         > >
>> >>         > >            3 pgs down
>> >>         > >
>> >>         > >            19 pgs incomplete
>> >>         > >
>> >>         > >            19 pgs stuck inactive
>> >>         > >
>> >>         > >            19 pgs stuck unclean
>> >>         > >
>> >>         > >            7 requests are blocked > 32 sec
>> >>         > >
>> >>         > >     monmap e11: 7 mons at
>> >>         > >
>> >>         > > {0=192.168.0.204:6789/0,1=192.168.0.201:6789/0
>> >>         <http://192.168.0.204:6789/0,1=192.168.0.201:6789/0>,
>> >>         > >
>> >>         > >
>> >>         2=192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202
>> >>         <
>> http://192.168.0.203:6789/0,3=192.168.0.205:6789/0,4=192.168.0.202>:
>> >>         > >
>> >>         > > 6789/0,5=192.168.0.206:6789/0,6=192.168.0.207:6789/0
>> >>         <http://192.168.0.206:6789/0,6=192.168.0.207:6789/0>}
>> >>         > >
>> >>         > >            election epoch 722, quorum
>> >>         > >
>> >>         > > 0,1,2,3,4,5,6 1,4,2,0,3,5,6
>> >>         > >
>> >>         > >     osdmap e10182: 10 osds: 10 up, 10 in
>> >>         > >
>> >>         > >      pgmap v3295880: 1024 pgs, 2 pools, 4563 GB data, 1143
>> >>         kobjects
>> >>         > >
>> >>         > >            9136 GB used, 5710 GB / 14846 GB avail
>> >>         > >
>> >>         > >                1005 active+clean
>> >>         > >
>> >>         > >                  16 incomplete
>> >>         > >
>> >>         > >                   3 down+incomplete
>> >>         > >
>> >>         > >
>> >>         > > Unfortunately "7 requests blocked" means no virtual
>> >>         machine can boot
>> >>         > >
>> >>         > > because ceph has stopped i/o.
>> >>         > >
>> >>         > >
>> >>         > > I can accept to lose some data, but not ALL data!
>> >>         > >
>> >>         > > Can you help me please?
>> >>         > >
>> >>         > > Thanks,
>> >>         > >
>> >>         > > Mario
>> >>         > >
>> >>         > >
>> >>         > > _______________________________________________
>> >>         > >
>> >>         > > ceph-users mailing list
>> >>         > >
>> >>         > > ceph-users@lists.ceph.com <mailto:
>> ceph-users@lists.ceph.com>
>> >>         > >
>> >>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>         > >
>> >>         > >
>> >>         > > _______________________________________________
>> >>         > > ceph-users mailing list
>> >>         > > ceph-users@lists.ceph.com <mailto:
>> ceph-users@lists.ceph.com>
>> >>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>         > >
>> >>         > > _______________________________________________
>> >>         > > ceph-users mailing list
>> >>         > > ceph-users@lists.ceph.com <mailto:
>> ceph-users@lists.ceph.com>
>> >>         > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>         > >
>> >>
>> >>
>> >>         --
>> >>         Christian Balzer        Network/Systems Engineer
>> >>         ch...@gol.com <mailto:ch...@gol.com>           Global OnLine
>> >>         Japan/Rakuten Communications
>> >>         http://www.gol.com/
>> >>
>> >>     _______________________________________________
>> >>     ceph-users mailing list
>> >>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

Reply via email to