Hi,
I have set noout, noscrub and nodeep-scrub and the last time we added osd's
we adding few at a time.
The main issue here is with IOPS where the existing osd's are not able to
backfill at a higher rate - not even 1 thread during peak hours and a max
of 2 threads during off-peak. We are getting more client i/o and the
documents ingested are more than the space being freed up by backfilling
pg's to new osd's added.
Below is our cluster health
 health HEALTH_WARN
            5221 pgs backfill_wait
            31 pgs backfilling
            1453 pgs degraded
            4 pgs recovering
            1054 pgs recovery_wait
            1453 pgs stuck degraded
            6310 pgs stuck unclean
            384 pgs stuck undersized
            384 pgs undersized
            recovery 130823732/9142530156 objects degraded (1.431%)
            recovery 2446840943/9142530156 objects misplaced (26.763%)
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set
            mon.mon_1 store is getting too big! 26562 MB >= 15360 MB
            mon.mon_2 store is getting too big! 26828 MB >= 15360 MB
            mon.mon_3 store is getting too big! 26504 MB >= 15360 MB
     monmap e1: 3 mons at
{mon_1=x.x.x.x:x.yyyy/0,mon_2=x.x.x.x:yyyy/0,mon_3=x.x.x.x:yyyy/0}
            election epoch 7996, quorum 0,1,2 mon_1,mon_2,mon_3
     osdmap e194833: 105 osds: 105 up, 105 in; 5931 remapped pgs
            flags
noout,nobackfill,noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
      pgmap v48390703: 10536 pgs, 18 pools, 144 TB data, 2906 Mobjects
            475 TB used, 287 TB / 763 TB avail
            130823732/9142530156 objects degraded (1.431%)
            2446840943/9142530156 objects misplaced (26.763%)
                4851 active+remapped+wait_backfill
                4226 active+clean
                 659 active+recovery_wait+degraded+remapped
                 377 active+recovery_wait+degraded
                 357 active+undersized+degraded+remapped+wait_backfill
                  18 active+recovery_wait+undersized+degraded+remapped
                  16 active+degraded+remapped+backfilling
                  13 active+degraded+remapped+wait_backfill
                   9 active+undersized+degraded+remapped+backfilling
                   6 active+remapped+backfilling
                   2 active+recovering+degraded
                   2 active+recovering+degraded+remapped
  client io 11894 kB/s rd, 105 kB/s wr, 981 op/s rd, 72 op/s wr

So, is it a good option to add new osd's on a new node with ssd's as
journals?
in.linkedin.com/in/nikhilravindra



On Sun, Apr 28, 2019 at 6:05 AM Erik McCormick <emccorm...@cirrusseven.com>
wrote:

> On Sat, Apr 27, 2019, 3:49 PM Nikhil R <nikh.ravin...@gmail.com> wrote:
>
>> We have baremetal nodes 256GB RAM, 36core CPU
>> We are on ceph jewel 10.2.9 with leveldb
>> The osd’s and journals are on the same hdd.
>> We have 1 backfill_max_active, 1 recovery_max_active and 1
>> recovery_op_priority
>> The osd crashes and starts once a pg is backfilled and the next pg tried
>> to backfill. This is when we see iostat and the disk is utilised upto 100%.
>>
>
> I would set noout to prevent excess movement in the event of OSD flapping,
> and disable scrubbing and deep scrubbing until your backfilling has
> completed. I would also bring the new OSDs online a few at a time rather
> than all 25 at once if you add more servers.
>
>
>> Appreciate your help David
>>
>> On Sun, 28 Apr 2019 at 00:46, David C <dcsysengin...@gmail.com> wrote:
>>
>>>
>>>
>>> On Sat, 27 Apr 2019, 18:50 Nikhil R, <nikh.ravin...@gmail.com> wrote:
>>>
>>>> Guys,
>>>> We now have a total of 105 osd’s on 5 baremetal nodes each hosting 21
>>>> osd’s on HDD which are 7Tb with journals on HDD too. Each journal is about
>>>> 5GB
>>>>
>>>
>>> This would imply you've got a separate hdd partition for journals, I
>>> don't think there's any value in that and would probabaly be detrimental to
>>> performance.
>>>
>>>>
>>>> We expanded our cluster last week and added 1 more node with 21 HDD and
>>>> journals on same disk.
>>>> Our client i/o is too heavy and we are not able to backfill even 1
>>>> thread during peak hours - incase we backfill during peak hours osd's are
>>>> crashing causing undersized pg's and if we have another osd crash we wont
>>>> be able to use our cluster due to undersized and recovery pg's. During
>>>> non-peak we can just backfill 8-10 pgs.
>>>> Due to this our MAX AVAIL is draining out very fast.
>>>>
>>>
>>> How much ram have you got in your nodes? In my experience that's a
>>> common reason for crashing OSDs during recovery ops
>>>
>>> What does your recovery and backfill tuning look like?
>>>
>>>
>>>
>>>> We are thinking of adding 2 more baremetal nodes with 21 *7tb  osd’s on
>>>>  HDD and add 50GB SSD Journals for these.
>>>> We aim to backfill from the 105 osd’s a bit faster and expect writes of
>>>> backfillis coming to these osd’s faster.
>>>>
>>>
>>> Ssd journals would certainly help, just be sure it's a model that
>>> performs well with Ceph
>>>
>>>>
>>>> Is this a good viable idea?
>>>> Thoughts please?
>>>>
>>>
>>> I'd recommend sharing more detail e.g full spec of the nodes, Ceph
>>> version etc.
>>>
>>>>
>>>> -Nikhil
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>> --
>> Sent from my iPhone
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to