Hi Brad,

We fully understood the hardware we currently use are under Ceph's 
recommendation, so we are seeking for a method to lower or restrict the 
resources needed by OSD. Definitely losing some performance is acceptable for 
us.

The reason why we did these experiments and discuss causes is that we want to 
find the true factors that reflect the memory usage. I think it is beneficial 
for the Ceph community and we can convince our customers and other Ceph users 
to realize the feasibility and stability of Ceph on different hardware 
infrastructure for production.

With your comments, we have more confidence on the memory consumption of Ceph 
OSD.

Hope there still exist some methods or workarounds to bound the memory 
consumption (tuning configs?), or we would just accept the recommendations on 
the website. (also, could we say 1GB / 1TB is the maximum requirement? or just 
enough under normal circumstances?)

Thank you very much.

Sincerely,
Craig Chi

On 2016-11-29 10:27, Brad Hubbard<bhubb...@redhat.com>wrote:
>   
>   
> On Tue, Nov 29, 2016 at 3:12 AM, Craig 
> Chi<craig...@synology.com(mailto:craig...@synology.com)>wrote:
> > Hi guys,
> >   
> > Thanks to both of your suggestions, we had some progression on this issue.
> >   
> > I tuned vm.min_free_kbytes to 16GB and raised vm.vfs_cache_pressure to 200, 
> > and I did observe that the OS keep releasing cache while the OSDs want more 
> > and more memory.
>   
> vfs_cache_pressure is a percentage so values>100 have always seemed odd to me.
> >   
> > OK. Now we are going to reproduce the hanging issue.
> >   
> > 1. set the cluster with noup flag
> > 2. restart all ceph-osd process (then we can see all OSDs are down from 
> > ceph monitor)
> > 3. unset noup flag
> >   
> > As expected the OSDs started to consume memory, and eventually the kernel 
> > still hanged without response.
> >   
> > Therefore I learned to gather the vmcore and tried to investigate further 
> > as Brad advised.
> >   
> > The vmcore dump file was unbeliviably huge -- about 6 GB per dump. However 
> > it's helpful that we quickly found the following abnormal things:
> >   
> > 1. The memory was exhausted as expected.
> >   
> > crash>kmem -i
> > PAGESTOTALPERCENTAGE
> > TOTAL MEM63322527241.6 GB----
> > FREE6764462.6 GB1% of TOTAL MEM
> > USED62646081239 GB98% of TOTAL MEM
> > SHARED6213362.4 GB0% of TOTAL MEM
> > BUFFERS47307184.8 MB0% of TOTAL MEM
> > CACHED3762051.4 GB0% of TOTAL MEM
> > SLAB4554001.7 GB0% of TOTAL MEM
> >   
> > TOTAL SWAP488703918.6 GB----
> > SWAP USED385593814.7 GB78% of TOTAL SWAP
> > SWAP FREE10311013.9 GB21% of TOTAL SWAP
> >   
> > COMMIT LIMIT36548302139.4 GB----
> > COMMITTED92434847352.6 GB252% of TOTAL LIMIT
>   
> As Nick already mentioned,90x8TB disks is 720Tb of storage and, according 
> tohttp://docs.ceph.com/docs/jewel/start/hardware-recommendations/#ramduring 
> recovery you may require ~1GB per 1TB of storage per daemon.
> >   
> >   
> > 2. Each OSD used a lot of memory. (We have only total 256 GB RAM but there 
> > are 90 OSDs in a node)
> >   
> > # Find 10 largest memory consumption processes
> > crash>ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = 
> > $8/1024" MB"; print}' | tail -10
> > 100864 1 12 ffff883a43e1b700 IN 1.1 7484884 2973.33 MB ceph-osd
> > 87400 1 27 ffff8838538ae040 IN 1.1 7557500 3036.92 MB ceph-osd
> > 108126 1 22 ffff882bcca91b80 IN 1.2 7273068 3045.8 MB ceph-osd
> > 39787 1 28 ffff883f468ab700 IN 1.2 7300756 3067.88 MB ceph-osd
> > 44861 1 20 ffff883cf9250000 IN 1.2 7327496 3067.89 MB ceph-osd
> > 30486 1 23 ffff883f59e1c4c0 IN 1.2 7332828 3083.58 MB ceph-osd
> > 125239 1 15 ffff882687018000 IN 1.2 6965560 3103.36 MB ceph-osd
> > 123807 1 19 ffff88275d90ee00 IN 1.2 7314484 3173.48 MB ceph-osd
> > 116445 1 1 ffff882863926e00 IN 1.2 7279040 3269.09 MB ceph-osd
> > 94442 1 0 ffff882ed2d01b80 IN 1.3 7566148 3418.69 MB ceph-osd
>   
> Based on the information above this is not excessive memory usage AFAICS.
> >   
> >   
> > 3. The excessive amount of message threads.
> >   
> > crash>ps | grep ms_pipe_read | wc -l
> > 144112
> > crash>ps | grep ms_pipe_write | wc -l
> > 146692
> >   
> > Totally up to 290k threads in ms_pipe_*.
> >   
> >   
> > 4. Several tries we had, and we luckily got some memory profiles before oom 
> > killer started to work.
> >   
> > # Parse the smaps of a ceph-osd process 
> > byparse_smaps.py(https://github.com/craig08/parse_smaps)
> >   
> > root@ceph2:~# ./parse_smaps.py /proc/198557/smaps
> > ===============================================================================
> > PrivatePrivateSharedShared
> > Clean+Dirty+Clean+Dirty=Total: library
> > ===============================================================================
> > 16660 kB + 5804548 kB +0 kB +0 kB = 5821208 kB : [heap]
> > 40 kB +92 kB +7640 kB +0 kB =7772 kB : ceph-osd
> > 56 kB +2472 kB +0 kB +0 kB =2528 kB : [anonymous]
> > 2084 kB +0 kB +0 kB +0 kB =2084 kB : 007656.ldb
> > 2080 kB +0 kB +0 kB +0 kB =2080 kB : 007657.ldb
> > 2080 kB +0 kB +0 kB +0 kB =2080 kB : 007653.ldb
> > 2080 kB +0 kB +0 kB +0 kB =2080 kB : 007658.ldb
> > 2080 kB +0 kB +0 kB +0 kB =2080 kB : 011125.ldb
> > 2076 kB +0 kB +0 kB +0 kB =2076 kB : 009607.ldb
> > 2072 kB +0 kB +0 kB +0 kB =2072 kB : 011127.ldb
> > 2072 kB +0 kB +0 kB +0 kB =2072 kB : 011126.ldb
> > 0 kB +24 kB +1636 kB +0 kB =1660 kB :libc-2.23.so(http://libc-2.23.so)
> > 0 kB +0 kB +1060 kB +0 kB =1060 kB : libec_lrc.so
> > 4 kB +28 kB +1024 kB +0 kB =1056 kB : libstdc++.so.6.0.21
> > 996 kB +0 kB +0 kB +0 kB =996 kB : 011168.ldb
> > 908 kB +0 kB +0 kB +0 kB =908 kB : 009608.ldb
> > 840 kB +0 kB +0 kB +0 kB =840 kB : 007648.ldb
> > 0 kB +0 kB +812 kB +0 kB =812 kB : libcls_rgw.so
> > 0 kB +0 kB +716 kB +0 kB =716 kB : libcls_refcount.so
> > 684 kB +0 kB +0 kB +0 kB =684 kB : 011128.ldb
> > 0 kB +0 kB +552 kB +0 kB =552 kB :libm-2.23.so(http://libm-2.23.so)
> > 0 kB +0 kB +472 kB +0 kB =472 kB : libec_jerasure_sse4.so
> > 4 kB +0 kB +372 kB +0 kB =376 kB : libfreebl3.so
> > 8 kB +4 kB +356 kB +0 kB =368 kB : libnss3.so
> > 0 kB +12 kB +352 kB +0 kB =364 kB : libleveldb.so.1.18
> > 0 kB +0 kB +296 kB +0 kB =296 kB : libec_jerasure.so
> > 4 kB +4 kB +224 kB +0 kB =232 kB : libsoftokn3.so
> > 8 kB +0 kB +208 kB +0 kB =216 kB : libnspr4.so
> > 8 kB +0 kB +196 kB +0 kB =204 kB : libec_isa.so
> > 0 kB +8 kB +196 kB +0 kB =204 kB : libtcmalloc.so.4.2.6
> > 0 kB +8 kB +152 kB +0 kB =160 kB :ld-2.23.so(http://ld-2.23.so)
> > 4 kB +4 kB +136 kB +0 kB =144 kB : libboost_thread.so.1.58.0
> > 4 kB +0 kB +132 kB +0 kB =136 kB : ibnssutil3.so
> > ......
> > ===============================================================================
> > 37736 kB + 5807224 kB + 18128 kB +0 kB = 5863088 kB : Total
> >   
> >   
> > 5. Heap profiler by Ceph.
> >   
> > root@ceph2:~# ceph tell osd.163 heap stats
> > osd.163 tcmalloc heap stats:------------------------------------------------
> > MALLOC:5861094560 ( 5589.6 MiB) Bytes in use by application
> > MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
> > MALLOC: +38945176 (37.1 MiB) Bytes in central cache freelist
> > MALLOC: +13279168 (12.7 MiB) Bytes in transfer cache freelist
> > MALLOC: +96438792 (92.0 MiB) Bytes in thread cache freelists
> > MALLOC: +25817248 (24.6 MiB) Bytes in malloc metadata
> > MALLOC:------------
> > MALLOC: =6035574944(tel:6035574944)( 5756.0 MiB) Actual memory used 
> > (physical + swap)
> > MALLOC: +35741696 (34.1 MiB) Bytes released to OS (aka unmapped)
> > MALLOC:------------
> > MALLOC: =6071316640 ( 5790.1 MiB) Virtual address space used
> > MALLOC:
> > MALLOC:357627Spans in use
> > MALLOC:89Thread heaps in use
> > MALLOC:8192Tcmalloc page size
> > ------------------------------------------------
> >   
> >   
> > 6. google-pprof the heap dump
> > Total: 1916.6 MB
> > 1036.954.1%54.1%1036.954.1% ceph::buffer::create_aligned
> > 313.916.4%70.5%313.916.4% ceph::buffer::list::append@a78c00
> > 220.011.5%82.0%220.011.5% std::_Rb_tree::_M_emplace_hint_unique
> > 130.06.8%88.7%130.06.8% leveldb::ReadBlock
> > 129.86.8%95.5%129.86.8% std::vector::_M_default_append
> > 22.11.2%96.7%53.42.8% PG::add_log_entry
> > 7.40.4%97.1%7.40.4% ceph::buffer::list::crc32c
> > 7.00.4%97.4%7.00.4% ceph::log::Log::create_entry
> > 5.10.3%97.7%5.10.3% OSD::get_tracked_conf_keys
> > 4.70.2%97.9%4.80.2% Pipe::Pipe
> > 3.30.2%98.1%9.00.5% decode_message
> > 3.20.2%98.3%6.70.3% SimpleMessenger::add_accept_pipe
> > 3.20.2%98.4%4.50.2% OSD::_make_pg
> > ...
> >   
> >   
> > We have some hypotheses after discussion:
> >   
> > 1. We observed that the number of connection (counted by `netstat -ant | 
> > grep ESTABLISHED | wc -l`) rises rapidly along with the average memory used 
> > by ceph-osd, especially [heap] section.
> > >Is there any relation between memory usage and the number of network 
> > >connection?
>   
> Yes.
> >   
> > 2. After unset noup flag, the number of connection bursts to over 200k in 
> > few seconds.
> > >We have an EC pool created with k=17, m=3. Is the large combination of 
> > >(k,m) responsible for these connections?
>   
> Possibly a contributing factor. Someone who knows more on the EC side may be 
> able to comment.
> > >We have average 300 pgs per OSD in the crash experiment. Is the high pgs 
> > >per OSD responsible for these connections?
>   
> Almost certainly a contributing factor.
> >   
> > 3. With simple messenger we forked two message threads for single network 
> > connection.
>   
> Correct, one reader, one writer.
> > >We think 290k message threads in the same time are hard to work normally 
> > >and efficiently.
> > >Will it be better with async messenger? Tried with async messenger, we saw 
> > >threads decreased but the number of network connection still high and the 
> > >kernel hang issue continued.
>   
> You should see an improvement with async messenger in the number of threads 
> used.
> >   
> >   
> > Now we are still struggling with this problem.
> >   
> > Please kindly instruct us if you have any directions.
> >   
> > Sincerely,
> > Craig Chi
> >   
> > On 2016-11-25 21:26, Nick 
> > Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> > >   
> > > Hi,
> > >   
> > >   
> > >   
> > >   
> > >   
> > > I didn’t so the maths, so maybe 7GB isn’t worth tuning for, although 
> > > every little helps ;-)
> > >   
> > >   
> > >   
> > >   
> > >   
> > > I don’t believe peering or recovery should effect this value, but other 
> > > things will consume memory during recovery, but I’m not aware if this can 
> > > be limited or tuned.
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Yes, the write and read cache’s will consume memory and may limit Linux’s 
> > > ability to react quickly enough in tight memory conditions. I believe you 
> > > can be in a state where it looks like you have more memory potentially 
> > > available than actually is usable at that point in time. The 
> > > min_free_bytes can help here.
> > >   
> > >   
> > >   
> > >   
> > >   
> > > From:Craig Chi [mailto:craig...@synology.com]
> > > Sent:25 November 2016 01:46
> > > To:Brad Hubbard<bhubb...@redhat.com(mailto:bhubb...@redhat.com)>
> > > Cc:Nick Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>; Ceph 
> > > Users<ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)>
> > >   
> > >   
> > > Subject:Re: [ceph-users] Ceph OSDs cause kernel unresponsive
> > >   
> > >   
> > >   
> > >   
> > > Hi Nick,
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > I have seen the report before, if I understand correctly, the 
> > > osd_map_cache_size generally introduces a fixed amount of memory usage. 
> > > We are using the default value of 200, and a single osd map I got from 
> > > our cluster is 404KB.
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > That is totally 404KB * 200 * 90 (osds) = about 7GB on each node.
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Will the memory consumption generated by this factor become larger when 
> > > unstably peering or recovering? If not, we still need to find the root 
> > > cause of why free memory drops without control.
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Does anyone know that what is the relation between filestore or journal 
> > > configurations and the OSD's memory consumption? Is it possible that the 
> > > filestore queue or journal queue occupy huge memory pages and cause 
> > > filesystem cache hard to release (and result in oom)?
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > At last, about nobarrier, I fully knew the consequence and is seriously 
> > > testing on this option. Sincerely appreciate your kindness and useful 
> > > suggestions.
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Sincerely,
> > > Craig Chi
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > On 2016-11-25 07:23, Brad 
> > > Hubbard<bhubb...@redhat.com(mailto:bhubb...@redhat.com)>wrote:
> > >   
> > >   
> > >   
> > >   
> > > >   
> > > > Two of these appear to be hung task timeouts and the other is an 
> > > > invalid opcode.
> > > >   
> > > >   
> > > >   
> > > > There is no evidence here of memory exhaustion (although it remains to 
> > > > be seen whether this is a factor but I'd expect to see evidence of 
> > > > shrinker activity in the stacks) and I would speculate the increased 
> > > > memory utilisation is due to the issues with the OSD tasks.
> > > >   
> > > >   
> > > >   
> > > > I would suggest that the next step here is to work out specifically why 
> > > > the invalid opcode happened and/or why kernel tasks are hanging for>120 
> > > > seconds.
> > > >   
> > > >   
> > > >   
> > > > To do that you may need to capture a vmcore and analyse it and/or 
> > > > engage your kernel support team to investigate further.
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > > On Fri, Nov 25, 2016 at 8:26 AM, Nick 
> > > > Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> > > >   
> > > >   
> > > >   
> > > > >   
> > > > > There’s a couple of things you can do to reduce memory usage by 
> > > > > limiting the number of OSD maps each OSD stores, but you will still 
> > > > > be pushing up against the limits of the ram you have available. There 
> > > > > is a Cern 30PB test (should be on google) which gives some details on 
> > > > > some of the settings, but quite a few are no longer relevant in jewel.
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > Once other thing, I saw you have nobarrier set on mount options. 
> > > > > Please please please understand the consequences of this option!!!!
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf 
> > > > > OfCraig Chi
> > > > > Sent:24 November 2016 10:37
> > > > > To:Nick Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>
> > > > > Cc:ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > > > > Subject:Re: [ceph-users] Ceph OSDs cause kernel unresponsive
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > Hi Nick,
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > Thank you for your helpful information.
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > I knew that Ceph recommends 1GB/1TB RAM, but we are not going to 
> > > > > change the hardware architecture now.
> > > > >   
> > > > >   
> > > > >   
> > > > > Are there any methods to set the resource limit one OSD can consume?
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > And for your question, we currently set system configuration as:
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > vm.swappiness=10
> > > > > kernel.pid_max=4194303
> > > > > fs.file-max=26234859
> > > > > vm.zone_reclaim_mode=0
> > > > > vm.vfs_cache_pressure=50
> > > > > vm.min_free_kbytes=4194303
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > I would try to configure vm.min_free_kbytes larger and test.
> > > > >   
> > > > >   
> > > > >   
> > > > > I will be grateful if anyone has the experience of how to tune these 
> > > > > values for Ceph.
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > Sincerely,
> > > > > Craig Chi
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > On 2016-11-24 17:48, Nick 
> > > > > Fisk<n...@fisk.me.uk(mailto:n...@fisk.me.uk)>wrote:
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > >   
> > > > > > Hi Craig,
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > From:ceph-users [mailto:ceph-users-boun...@lists.ceph.com]On Behalf 
> > > > > > OfCraig Chi
> > > > > > Sent:24 November 2016 08:34
> > > > > > To:ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > > > > > Subject:[ceph-users] Ceph OSDs cause kernel unresponsive
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > Hi Cephers,
> > > > > >   
> > > > > > We have encountered kernel hanging issue on our Ceph cluster. Just 
> > > > > > likehttp://imgur.com/a/U2Flz(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/1/s2l8QW3r2MBcuTtkOchPAQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFUUFHZ0tqX2ZNQUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWU4ya00wZjdibUxSb1JUbTQxai04M2owOGlBQUFsQkkvMS9MLXF2SjJJMXZZbHItSno0TjdFUVBBL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGVFVGSFoydzRaRGxyUVVGQlFVRkJRVUZCUVVZeloyUjRTVUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1U1eVpHNDRObnBYYmpoak5sUkhNbVZ5VjNWM1NYQnNRbEpCUVVGc1Fra3ZNUzgwTTNOeVRqaEJRV041WDB4eFUybzVZVnBIU0dkUkwyRklVakJqUkc5MlRESnNkRm96Vm5sTWJVNTJZbE01YUV3eFZYbFNiWGcy),http://imgur.com/a/lyEko(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/2/jXjr5Svj5qYsGDe_bXH6Kw/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFUUFHZ0tqX2ZNQUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWU4ya00wZjdibUxSb1JUbTQxai04M2owOGlBQUFsQkkvMi9BNEt4bXk3T2psSWdFem1acW5qbmVRL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGVFVGSFoydzRaRGxyUVVGQlFVRkJRVUZCUVVZeloyUjRTVUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1U1eVpHNDRObnBYYmpoak5sUkhNbVZ5VjNWM1NYQnNRbEpCUVVGc1Fra3ZNaTlLYWtGRk1uZFpNemhCWVVoVVVVcEJTVUZyVWxCQkwyRklVakJqUkc5MlRESnNkRm96Vm5sTWJVNTJZbE01YUV3eWVEVlNWM1Iy)orhttp://imgur.com/a/IGXdu(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/3/pUmhLYJ2tqb3zZVNvpA0fg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFUUFHZ0tqX2ZNQUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWU4ya00wZjdibUxSb1JUbTQxai04M2owOGlBQUFsQkkvMy9tY2J5bnBVWkd2amgzWnpTdnFwVnJRL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGVFVGSFoydzRaRGxyUVVGQlFVRkJRVUZCUVVZeloyUjRTVUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1U1eVpHNDRObnBYYmpoak5sUkhNbVZ5VjNWM1NYQnNRbEpCUVVGc1Fra3ZNeTlvWmpoeFREWjVaVlZ5ZWt0elUwNW5jbWN3WTBoUkwyRklVakJqUkc5MlRESnNkRm96Vm5sTWJVNTJZbE01YUV3d2JFaFhSMUl4).
> > > > > >   
> > > > > > We believed it is caused by out of memory, because we observed that 
> > > > > > when OSDs went crazy, the available memory of each node were 
> > > > > > decreasing rapidly (from 50% available to lower than 10%). Then the 
> > > > > > node running Ceph OSD became unresponsive with console showing 
> > > > > > hung_task_timout or slab_out_of_memory, etc. The only thing we can 
> > > > > > do then is hard reset the unit.
> > > > > >   
> > > > > > It is hard to predict when the kernel hanging issue will happen. In 
> > > > > > my past experiences, it usually happened after a long term 
> > > > > > benchmark procedure, and followed by a manual trigger like 1) 
> > > > > > reboot a node 2) restart all OSDs 3) modify CRUSH map.
> > > > > >   
> > > > > > Currently the cluster is back to normal, but we want to figure out 
> > > > > > the root cause to avoid happening again. We think the high values 
> > > > > > of ceph.conf are pretty suspicous, but without code tracing we are 
> > > > > > hard to realize the impact of the values and the memory consumption.
> > > > > >   
> > > > > > Many thanks if you have any suggestions.
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > I think you are probably running out of memory, 90x8TB disks is 
> > > > > > 720Tb of storage, that will need a lot of ram to run and also the 
> > > > > > fact that the problems occur when PG’s start moving around after a 
> > > > > > node failure also suggests this.
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > Have you adjusted your vm.vfs_cache_pressure?
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > You might also want to try setting vm.min_free_kbytes to 8-16GB to 
> > > > > > try and keep some memory free and avoid fragmentation.
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > =================================================================================
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > Following is our ceph cluster architecture:
> > > > > >   
> > > > > > OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 
> > > > > > GNU/Linux)
> > > > > > Ceph: Jewel 10.2.3
> > > > > >   
> > > > > > 3 Ceph Monitors running on 3 dedicated machines
> > > > > > 630 Ceph OSDs running on 7 storage machines (each machine has 256GB 
> > > > > > RAM and 90 units of 8TB hard drives)
> > > > > >   
> > > > > > There are 4 pools with following settings:
> > > > > > vms512pg x 3 replica
> > > > > > images512pg x 3 replica
> > > > > > volumes 8192 pg x 3 replica
> > > > > > objects 4096 pg x (17,3) erasure code profile
> > > > > >   
> > > > > > ==>average 173.92 pgs per OSD
> > > > > >   
> > > > > > We tuned our ceph.conf by referencing many performance tuning 
> > > > > > resources online ( mainly from slide 38 
> > > > > > ofhttps://goo.gl/Idkh41(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/4/PxWvVvLkgeQIcQIgc-t7HQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFUUFHZ0tqX2ZNQUFBQUFBQUFBQUV0ckRjc0FBRE5KQld3QUFBQUFBQUNSWHdCWU4ya00wZjdibUxSb1JUbTQxai04M2owOGlBQUFsQkkvNC9wN0VKMEFiUjU0LS1IYUQ1U3dOemZnL2FIUjBjRG92TDNodk5IUXViV291WVcwdmJHNXJMMEZGVFVGSFoydzRaRGxyUVVGQlFVRkJRVUZCUVVZeloyUjRTVUZCUkU1S1FsZDNRVUZCUVVGQlFVTlNXSGRDV1U1eVpHNDRObnBYYmpoak5sUkhNbVZ5VjNWM1NYQnNRbEpCUVVGc1Fra3ZOQzlIY1V0UlZqTkZRbFJLVFhWR1RUWnZiblF3YWtWQkwyRklVakJqU0UwMlRIazVibUl5T0hWYU1uZDJVMWRTY21GRVVYZw))
> > > > > >   
> > > > > > [global]
> > > > > > osd pool default pg num = 4096
> > > > > > osd pool default pgp num = 4096
> > > > > > err to syslog = true
> > > > > > log to syslog = true
> > > > > > osd pool default size = 3
> > > > > > max open files = 131072
> > > > > > fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595
> > > > > > osd crush chooseleaf type = 1
> > > > > >   
> > > > > > [mon.mon1]
> > > > > > host = mon1
> > > > > > mon addr = 172.20.1.2
> > > > > >   
> > > > > > [mon.mon2]
> > > > > > host = mon2
> > > > > > mon addr = 172.20.1.3
> > > > > >   
> > > > > > [mon.mon3]
> > > > > > host = mon3
> > > > > > mon addr = 172.20.1.4
> > > > > >   
> > > > > > [mon]
> > > > > > mon osd full ratio = 0.85
> > > > > > mon osd nearfull ratio = 0.7
> > > > > > mon osd down out interval = 600
> > > > > > mon osd down out subtree limit = host
> > > > > > mon allow pool delete = true
> > > > > > mon compact on start = true
> > > > > >   
> > > > > > [osd]
> > > > > > public_network 
> > > > > > =172.20.3.1/21(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/5/fE7VEc4nBNg0FS1OC8oLDA/aHR0cDovLzE3Mi4yMC4zLjEvMjE)
> > > > > > cluster_network 
> > > > > > =172.24.0.1/24(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/6/eKExu5hDFg3do38d716X8Q/aHR0cDovLzE3Mi4yNC4wLjEvMjQ)
> > > > > > osd disk threads = 4
> > > > > > osd mount options xfs = 
> > > > > > rw,noexec,nodev,noatime,nodiratime,nobarrier,inode64,logbsize=256k
> > > > > > osd crush update on start = false
> > > > > > osd op threads = 20
> > > > > > osd mkfs options xfs = -f -i size=2048
> > > > > > osd max write size = 512
> > > > > > osd mkfs type = xfs
> > > > > > osd journal size = 5120
> > > > > > filestore max inline xattrs = 6
> > > > > > filestore queue committing max bytes = 1048576000
> > > > > > filestore queue committing max ops = 5000
> > > > > > filestore queue max bytes = 1048576000
> > > > > > filestore op threads = 32
> > > > > > filestore max inline xattr size = 254
> > > > > > filestore max sync interval = 15
> > > > > > filestore min sync interval = 10
> > > > > > journal max write bytes = 1048576000
> > > > > > journal max write entries = 1000
> > > > > > journal queue max ops = 3000
> > > > > > journal queue max bytes = 1048576000
> > > > > > ms dispatch throttle bytes = 1048576000
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > Sincerely,
> > > > > > Craig Chi
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > > Sent from Synology MailPlus
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > Sent from Synology MailPlus
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > >   
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com(mailto:ceph-users@lists.ceph.com)
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com(http://xo4t.mj.am/lnk/AEQAGgqCAl4AAAAAAAAAAF3gdxIAADNJBWwAAAAAAACRXwBYODvxWRFiK_3sTuSyKexQX6HH1gAAlBI/7/yTUxBEgDoeMkL7Z9T4Fe0Q/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t)
> > > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > >   
> > > > --
> > > >   
> > > >   
> > > > Cheers,
> > > > Brad
> > > >   
> > > >   
> > > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > >   
> > > Sent from Synology MailPlus
> > >   
> > >   
> > >   
> > >   
> >   
> >   
> >   
> >   
> > Sent from Synology MailPlus
>   
>   
>   
>   
> --
> Cheers,
> Brad

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to