Two of these appear to be hung task timeouts and the other is an invalid
opcode.

There is no evidence here of memory exhaustion (although it remains to be
seen whether this is a factor but I'd expect to see evidence of shrinker
activity in the stacks) and I would speculate the increased memory
utilisation is due to the issues with the OSD tasks.

I would suggest that the next step here is to work out specifically why the
invalid opcode happened and/or why kernel tasks are hanging for > 120
seconds.

To do that you may need to capture a vmcore and analyse it and/or engage
your kernel support team to investigate further.


On Fri, Nov 25, 2016 at 8:26 AM, Nick Fisk <n...@fisk.me.uk> wrote:

> There’s a couple of things you can do to reduce memory usage by limiting
> the number of OSD maps each OSD stores, but you will still be pushing up
> against the limits of the ram you have available. There is a Cern 30PB test
> (should be on google) which gives some details on some of the settings, but
> quite a few are no longer relevant in jewel.
>
>
>
> Once other thing, I saw you have nobarrier set on mount options. Please
> please please understand the consequences of this option!!!!
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Craig Chi
> *Sent:* 24 November 2016 10:37
> *To:* Nick Fisk <n...@fisk.me.uk>
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Ceph OSDs cause kernel unresponsive
>
>
>
> Hi Nick,
>
>
>
> Thank you for your helpful information.
>
>
>
> I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change
> the hardware architecture now.
>
> Are there any methods to set the resource limit one OSD can consume?
>
>
>
> And for your question, we currently set system configuration as:
>
>
>
> vm.swappiness=10
> kernel.pid_max=4194303
> fs.file-max=26234859
> vm.zone_reclaim_mode=0
> vm.vfs_cache_pressure=50
> vm.min_free_kbytes=4194303
>
>
>
> I would try to configure vm.min_free_kbytes larger and test.
>
> I will be grateful if anyone has the experience of how to tune these
> values for Ceph.
>
>
>
> Sincerely,
> Craig Chi
>
>
>
> On 2016-11-24 17:48, Nick Fisk <n...@fisk.me.uk> wrote:
>
> Hi Craig,
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of* Craig Chi
> *Sent:* 24 November 2016 08:34
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] Ceph OSDs cause kernel unresponsive
>
>
>
> Hi Cephers,
>
> We have encountered kernel hanging issue on our Ceph cluster. Just like
> http://imgur.com/a/U2Flz
> <http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/1/L-qvJ2I1vYlr-Jz4N7EQPA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMS80M3NyTjhBQWN5X0xxU2o5YVpHSGdRL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwxVXlSbXg2>
> , http://imgur.com/a/lyEko
> <http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/2/A4Kxmy7OjlIgEzmZqnjneQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMi9KakFFMndZMzhBYUhUUUpBSUFrUlBBL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwyeDVSV3R2>
> or http://imgur.com/a/IGXdu
> <http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/3/mcbynpUZGvjh3ZzSvqpVrQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMy9oZjhxTDZ5ZVVyektzU05ncmcwY0hRL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwwbEhXR1Ix>
> .
>
> We believed it is caused by out of memory, because we observed that when
> OSDs went crazy, the available memory of each node were decreasing rapidly
> (from 50% available to lower than 10%). Then the node running Ceph OSD
> became unresponsive with console showing hung_task_timout or
> slab_out_of_memory, etc. The only thing we can do then is hard reset the
> unit.
>
> It is hard to predict when the kernel hanging issue will happen. In my
> past experiences, it usually happened after a long term benchmark
> procedure, and followed by a manual trigger like 1) reboot a node 2)
> restart all OSDs 3) modify CRUSH map.
>
> Currently the cluster is back to normal, but we want to figure out the
> root cause to avoid happening again. We think the high values of ceph.conf
> are pretty suspicous, but without code tracing we are hard to realize the
> impact of the values and the memory consumption.
>
> Many thanks if you have any suggestions.
>
>
>
> I think you are probably running out of memory, 90x8TB disks is 720Tb of
> storage, that will need a lot of ram to run and also the fact that the
> problems occur when PG’s start moving around after a node failure also
> suggests this.
>
>
>
> Have you adjusted your vm.vfs_cache_pressure?
>
>
>
> You might also want to try setting vm.min_free_kbytes to 8-16GB to try and
> keep some memory free and avoid fragmentation.
>
>
>
>
> ============================================================
> =====================
>
>
> Following is our ceph cluster architecture:
>
> OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 GNU/Linux)
> Ceph: Jewel 10.2.3
>
> 3 Ceph Monitors running on 3 dedicated machines
> 630 Ceph OSDs running on 7 storage machines (each machine has 256GB RAM
> and 90 units of 8TB hard drives)
>
> There are 4 pools with following settings:
> vms     512  pg x 3 replica
> images  512  pg x 3 replica
> volumes 8192 pg x 3 replica
> objects 4096 pg x (17,3) erasure code profile
>
> ==> average 173.92 pgs per OSD
>
> We tuned our ceph.conf by referencing many performance tuning resources
> online ( mainly from slide 38 of https://goo.gl/Idkh41
> <http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/4/p7EJ0AbR54--HaD5SwNzfg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvNC9HcUtRVjNFQlRKTXVGTTZvbnQwakVBL2FIUjBjSE02THk5bmIyOHVaMnd2U1dScmFEUXg>
> )
>
> [global]
> osd pool default pg num = 4096
> osd pool default pgp num = 4096
> err to syslog = true
> log to syslog = true
> osd pool default size = 3
> max open files = 131072
> fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595
> osd crush chooseleaf type = 1
>
> [mon.mon1]
> host = mon1
> mon addr = 172.20.1.2
>
> [mon.mon2]
> host = mon2
> mon addr = 172.20.1.3
>
> [mon.mon3]
> host = mon3
> mon addr = 172.20.1.4
>
> [mon]
> mon osd full ratio = 0.85
> mon osd nearfull ratio = 0.7
> mon osd down out interval = 600
> mon osd down out subtree limit = host
> mon allow pool delete = true
> mon compact on start = true
>
> [osd]
> public_network = 172.20.3.1/21
> cluster_network = 172.24.0.1/24
> osd disk threads = 4
> osd mount options xfs = rw,noexec,nodev,noatime,
> nodiratime,nobarrier,inode64,logbsize=256k
> osd crush update on start = false
> osd op threads = 20
> osd mkfs options xfs = -f -i size=2048
> osd max write size = 512
> osd mkfs type = xfs
> osd journal size = 5120
> filestore max inline xattrs = 6
> filestore queue committing max bytes = 1048576000
> filestore queue committing max ops = 5000
> filestore queue max bytes = 1048576000
> filestore op threads = 32
> filestore max inline xattr size = 254
> filestore max sync interval = 15
> filestore min sync interval = 10
> journal max write bytes = 1048576000
> journal max write entries = 1000
> journal queue max ops = 3000
> journal queue max bytes = 1048576000
> ms dispatch throttle bytes = 1048576000
>
>
>
> Sincerely,
> Craig Chi
>
>
>
>
>
> Sent from Synology MailPlus
>
>
> [image: Image removed by sender. Web Bug from
> http://xo4t.mj.am/oo/AEMAGgl8d9kAAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYNrdn86zWn8c6TG2erWuwIplBRAAAlBI/ba654ec4/e.gif]
>
>
>
>
>
> Sent from Synology MailPlus
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to