Hi, I have set noout, noscrub and nodeep-scrub and the last time we added osd's we adding few at a time. The main issue here is with IOPS where the existing osd's are not able to backfill at a higher rate - not even 1 thread during peak hours and a max of 2 threads during off-peak. We are getting more client i/o and the documents ingested are more than the space being freed up by backfilling pg's to new osd's added. Below is our cluster health health HEALTH_WARN 5221 pgs backfill_wait 31 pgs backfilling 1453 pgs degraded 4 pgs recovering 1054 pgs recovery_wait 1453 pgs stuck degraded 6310 pgs stuck unclean 384 pgs stuck undersized 384 pgs undersized recovery 130823732/9142530156 objects degraded (1.431%) recovery 2446840943/9142530156 objects misplaced (26.763%) noout,nobackfill,noscrub,nodeep-scrub flag(s) set mon.mon_1 store is getting too big! 26562 MB >= 15360 MB mon.mon_2 store is getting too big! 26828 MB >= 15360 MB mon.mon_3 store is getting too big! 26504 MB >= 15360 MB monmap e1: 3 mons at {mon_1=x.x.x.x:x.yyyy/0,mon_2=x.x.x.x:yyyy/0,mon_3=x.x.x.x:yyyy/0} election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3 osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects 475 TB used, 287 TB / 763 TB avail 130823732/9142530156 objects degraded (1.431%) 2446840943/9142530156 objects misplaced (26.763%) 4851 active+remapped+wait_backfill 4226 active+clean 659 active+recovery_wait+degraded+remapped 377 active+recovery_wait+degraded 357 active+undersized+degraded+remapped+wait_backfill 18 active+recovery_wait+undersized+degraded+remapped 16 active+degraded+remapped+backfilling 13 active+degraded+remapped+wait_backfill 9 active+undersized+degraded+remapped+backfilling 6 active+remapped+backfilling 2 active+recovering+degraded 2 active+recovering+degraded+remapped client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr
So, is it a good option to add new osd's on a new node with ssd's as journals? in.linkedin.com/in/nikhilravindra On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <emccorm...@cirrusseven.com> wrote: > On Sat, Apr 27, 2019, 3:49 PM Nikhil R <nikh.ravin...@gmail.com> wrote: > >> We have baremetal nodes 256GB RAM, 36core CPU >> We are on ceph jewel 10.2.9 with leveldb >> The osd’s and journals are on the same hdd. >> We have 1 backfill_max_active, 1 recovery_max_active and 1 >> recovery_op_priority >> The osd crashes and starts once a pg is backfilled and the next pg tried >> to backfill. This is when we see iostat and the disk is utilised upto 100%. >> > > I would set noout to prevent excess movement in the event of OSD flapping, > and disable scrubbing and deep scrubbing until your backfilling has > completed. I would also bring the new OSDs online a few at a time rather > than all 25 at once if you add more servers. > > >> Appreciate your help David >> >> On Sun, 28 Apr 2019 at 00:46, David C <dcsysengin...@gmail.com> wrote: >> >>> >>> >>> On Sat, 27 Apr 2019, 18:50 Nikhil R, <nikh.ravin...@gmail.com> wrote: >>> >>>> Guys, >>>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21 >>>> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about >>>> 5GB >>>> >>> >>> This would imply you've got a separate hdd partition for journals, I >>> don't think there's any value in that and would probabaly be detrimental to >>> performance. >>> >>>> >>>> We expanded our cluster last week and added 1 more node with 21 HDD and >>>> journals on same disk. >>>> Our client i/o is too heavy and we are not able to backfill even 1 >>>> thread during peak hours - incase we backfill during peak hours osd's are >>>> crashing causing undersized pg's and if we have another osd crash we wont >>>> be able to use our cluster due to undersized and recovery pg's. During >>>> non-peak we can just backfill 8-10 pgs. >>>> Due to this our MAX AVAIL is draining out very fast. >>>> >>> >>> How much ram have you got in your nodes? In my experience that's a >>> common reason for crashing OSDs during recovery ops >>> >>> What does your recovery and backfill tuning look like? >>> >>> >>> >>>> We are thinking of adding 2 more baremetal nodes with 21 *7tb osd’s on >>>> HDD and add 50GB SSD Journals for these. >>>> We aim to backfill from the 105 osd’s a bit faster and expect writes of >>>> backfillis coming to these osd’s faster. >>>> >>> >>> Ssd journals would certainly help, just be sure it's a model that >>> performs well with Ceph >>> >>>> >>>> Is this a good viable idea? >>>> Thoughts please? >>>> >>> >>> I'd recommend sharing more detail e.g full spec of the nodes, Ceph >>> version etc. >>> >>>> >>>> -Nikhil >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> -- >> Sent from my iPhone >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com