I've had issues magically fix themselves over night after waiting/trying things for hours.
On Tue, Oct 21, 2014 at 1:02 PM, Harald Rößler <harald.roess...@btd.de> wrote: > After more than 10 hours the same situation, I don’t think it will fix > self over time. How I can find out what is the problem. > > > Am 21.10.2014 um 17:28 schrieb Craig Lewis <cle...@centraldesktop.com>: > > That will fix itself over time. remapped just means that Ceph is moving > the data around. It's normal to see PGs in the remapped and/or backfilling > state after OSD restarts. > > They should go down steadily over time. How long depends on how much data > is in the PGs, how fast your hardware is, how many OSDs are affected, and > how much you allow recovery to impact cluster performance. Mine currently > take about 20 minutes per PG. If all 47 are on the same OSD, it'll be a > while. If they're evenly split between multiple OSDs, parallelism will > speed that up. > > On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler <harald.roess...@btd.de> > wrote: > >> Hi all, >> >> thank you for your support, now the file system is not degraded any more. >> Now I have a minus degrading :-) >> >> 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 >> active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / >> 6178 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded >> (-0.123%) >> >> but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery >> -1638/1329293 degraded (-0.123%) >> >> I think this warning is reported because there are 47 active+remapped >> objects, some ideas how to fix that now? >> >> Kind Regards >> Harald Roessler >> >> >> Am 21.10.2014 um 01:03 schrieb Craig Lewis <cle...@centraldesktop.com>: >> >> I've been in a state where reweight-by-utilization was deadlocked (not >> the daemons, but the remap scheduling). After successive osd reweight >> commands, two OSDs wanted to swap PGs, but they were both toofull. I ended >> up temporarily increasing mon_osd_nearfull_ratio to 0.87. That removed the >> impediment, and everything finished remapping. Everything went smoothly, >> and I changed it back when all the remapping finished. >> >> Just be careful if you need to get close to mon_osd_full_ratio. Ceph >> does greater-than on these percentages, not greater-than-equal. You really >> don't want the disks to get greater-than mon_osd_full_ratio, because all >> external IO will stop until you resolve that. >> >> >> On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master <keks...@gmail.com> >> wrote: >> >>> You can set lower weight on full osds, or try changing the >>> osd_near_full_ratio parameter in your cluster from 85 to for example 89. >>> But i don't know what can go wrong when you do that. >>> >>> >>> 2014-10-20 17:12 GMT+02:00 Wido den Hollander <w...@42on.com>: >>> >>>> On 10/20/2014 05:10 PM, Harald Rößler wrote: >>>> > yes, tomorrow I will get the replacement of the failed disk, to get a >>>> new node with many disk will take a few days. >>>> > No other idea? >>>> > >>>> >>>> If the disks are all full, then, no. >>>> >>>> Sorry to say this, but it came down to poor capacity management. Never >>>> let any disk in your cluster fill over 80% to prevent these situations. >>>> >>>> Wido >>>> >>>> > Harald Rößler >>>> > >>>> > >>>> >> Am 20.10.2014 um 16:45 schrieb Wido den Hollander <w...@42on.com>: >>>> >> >>>> >> On 10/20/2014 04:43 PM, Harald Rößler wrote: >>>> >>> Yes, I had some OSD which was near full, after that I tried to fix >>>> the problem with "ceph osd reweight-by-utilization", but this does not >>>> help. After that I set the near full ratio to 88% with the idea that the >>>> remapping would fix the issue. Also a restart of the OSD doesn’t help. At >>>> the same time I had a hardware failure of on disk. :-(. After that failure >>>> the recovery process start at "degraded ~ 13%“ and stops at 7%. >>>> >>> Honestly I am scared in the moment I am doing the wrong operation. >>>> >>> >>>> >> >>>> >> Any chance of adding a new node with some fresh disks? Seems like you >>>> >> are operating on the storage capacity limit of the nodes and that >>>> your >>>> >> only remedy would be adding more spindles. >>>> >> >>>> >> Wido >>>> >> >>>> >>> Regards >>>> >>> Harald Rößler >>>> >>> >>>> >>> >>>> >>> >>>> >>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander <w...@42on.com>: >>>> >>>> >>>> >>>> On 10/20/2014 02:45 PM, Harald Rößler wrote: >>>> >>>>> Dear All >>>> >>>>> >>>> >>>>> I have in them moment a issue with my cluster. The recovery >>>> process stops. >>>> >>>>> >>>> >>>> >>>> >>>> See this: 2 active+degraded+remapped+backfill_toofull >>>> >>>> >>>> >>>> 156 pgs backfill_toofull >>>> >>>> >>>> >>>> You have one or more OSDs which are to full and that causes >>>> recovery to >>>> >>>> stop. >>>> >>>> >>>> >>>> If you add more capacity to the cluster recovery will continue and >>>> finish. >>>> >>>> >>>> >>>>> ceph -s >>>> >>>>> health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 >>>> pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck >>>> unclean; recovery 111487/1488290 degraded (7.491%) >>>> >>>>> monmap e2: 3 mons at {0= >>>> 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, >>>> election epoch 332, quorum 0,1,2 0,12,6 >>>> >>>>> osdmap e6748: 24 osds: 23 up, 23 in >>>> >>>>> pgmap v43314672: 3328 pgs: 3031 active+clean, 43 >>>> active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 >>>> active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 >>>> active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 >>>> active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 >>>> active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, >>>> 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 >>>> active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 >>>> active+degraded+remapped+backfill_toofull, 2 >>>> active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB >>>> / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 >>>> degraded (7.491%) >>>> >>>>> >>>> >>>>> >>>> >>>>> I have tried to restart all OSD in the cluster, but does not help >>>> to finish the recovery of the cluster. >>>> >>>>> >>>> >>>>> Have someone any idea >>>> >>>>> >>>> >>>>> Kind Regards >>>> >>>>> Harald Rößler >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> _______________________________________________ >>>> >>>>> ceph-users mailing list >>>> >>>>> ceph-users@lists.ceph.com >>>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Wido den Hollander >>>> >>>> Ceph consultant and trainer >>>> >>>> 42on B.V. >>>> >>>> >>>> >>>> Phone: +31 (0)20 700 9902 >>>> >>>> Skype: contact42on >>>> >>>> _______________________________________________ >>>> >>>> ceph-users mailing list >>>> >>>> ceph-users@lists.ceph.com >>>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>>> >> >>>> >> >>>> >> -- >>>> >> Wido den Hollander >>>> >> Ceph consultant and trainer >>>> >> 42on B.V. >>>> >> >>>> >> Phone: +31 (0)20 700 9902 >>>> >> Skype: contact42on >>>> > >>>> >>>> >>>> -- >>>> Wido den Hollander >>>> Ceph consultant and trainer >>>> 42on B.V. >>>> >>>> Phone: +31 (0)20 700 9902 >>>> Skype: contact42on >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> >>> 2014-10-20 17:12 GMT+02:00 Wido den Hollander <w...@42on.com>: >>> >>>> On 10/20/2014 05:10 PM, Harald Rößler wrote: >>>> > yes, tomorrow I will get the replacement of the failed disk, to get a >>>> new node with many disk will take a few days. >>>> > No other idea? >>>> > >>>> >>>> If the disks are all full, then, no. >>>> >>>> Sorry to say this, but it came down to poor capacity management. Never >>>> let any disk in your cluster fill over 80% to prevent these situations. >>>> >>>> Wido >>>> >>>> > Harald Rößler >>>> > >>>> > >>>> >> Am 20.10.2014 um 16:45 schrieb Wido den Hollander <w...@42on.com>: >>>> >> >>>> >> On 10/20/2014 04:43 PM, Harald Rößler wrote: >>>> >>> Yes, I had some OSD which was near full, after that I tried to fix >>>> the problem with "ceph osd reweight-by-utilization", but this does not >>>> help. After that I set the near full ratio to 88% with the idea that the >>>> remapping would fix the issue. Also a restart of the OSD doesn’t help. At >>>> the same time I had a hardware failure of on disk. :-(. After that failure >>>> the recovery process start at "degraded ~ 13%“ and stops at 7%. >>>> >>> Honestly I am scared in the moment I am doing the wrong operation. >>>> >>> >>>> >> >>>> >> Any chance of adding a new node with some fresh disks? Seems like you >>>> >> are operating on the storage capacity limit of the nodes and that >>>> your >>>> >> only remedy would be adding more spindles. >>>> >> >>>> >> Wido >>>> >> >>>> >>> Regards >>>> >>> Harald Rößler >>>> >>> >>>> >>> >>>> >>> >>>> >>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander <w...@42on.com>: >>>> >>>> >>>> >>>> On 10/20/2014 02:45 PM, Harald Rößler wrote: >>>> >>>>> Dear All >>>> >>>>> >>>> >>>>> I have in them moment a issue with my cluster. The recovery >>>> process stops. >>>> >>>>> >>>> >>>> >>>> >>>> See this: 2 active+degraded+remapped+backfill_toofull >>>> >>>> >>>> >>>> 156 pgs backfill_toofull >>>> >>>> >>>> >>>> You have one or more OSDs which are to full and that causes >>>> recovery to >>>> >>>> stop. >>>> >>>> >>>> >>>> If you add more capacity to the cluster recovery will continue and >>>> finish. >>>> >>>> >>>> >>>>> ceph -s >>>> >>>>> health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 >>>> pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck >>>> unclean; recovery 111487/1488290 degraded (7.491%) >>>> >>>>> monmap e2: 3 mons at {0= >>>> 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, >>>> election epoch 332, quorum 0,1,2 0,12,6 >>>> >>>>> osdmap e6748: 24 osds: 23 up, 23 in >>>> >>>>> pgmap v43314672: 3328 pgs: 3031 active+clean, 43 >>>> active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 >>>> active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 >>>> active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 >>>> active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 >>>> active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, >>>> 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 >>>> active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 >>>> active+degraded+remapped+backfill_toofull, 2 >>>> active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB >>>> / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 >>>> degraded (7.491%) >>>> >>>>> >>>> >>>>> >>>> >>>>> I have tried to restart all OSD in the cluster, but does not help >>>> to finish the recovery of the cluster. >>>> >>>>> >>>> >>>>> Have someone any idea >>>> >>>>> >>>> >>>>> Kind Regards >>>> >>>>> Harald Rößler >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> _______________________________________________ >>>> >>>>> ceph-users mailing list >>>> >>>>> ceph-users@lists.ceph.com >>>> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Wido den Hollander >>>> >>>> Ceph consultant and trainer >>>> >>>> 42on B.V. >>>> >>>> >>>> >>>> Phone: +31 (0)20 700 9902 >>>> >>>> Skype: contact42on >>>> >>>> _______________________________________________ >>>> >>>> ceph-users mailing list >>>> >>>> ceph-users@lists.ceph.com >>>> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>>> >> >>>> >> >>>> >> -- >>>> >> Wido den Hollander >>>> >> Ceph consultant and trainer >>>> >> 42on B.V. >>>> >> >>>> >> Phone: +31 (0)20 700 9902 >>>> >> Skype: contact42on >>>> > >>>> >>>> >>>> -- >>>> Wido den Hollander >>>> Ceph consultant and trainer >>>> 42on B.V. >>>> >>>> Phone: +31 (0)20 700 9902 >>>> Skype: contact42on >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com