I'd recommend running through these steps and posting the output as well
http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

Bob

On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <
peter.malo...@brockmann-consult.de> wrote:

> How many PGs do you have? And did you change any config, like mds cache
> size? Show your ceph.conf.
>
>
> On 04/15/17 07:34, Aaron Ten Clay wrote:
>
> Hi all,
>
> Our cluster is experiencing a very odd issue and I'm hoping for some
> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
> eventually nuked by oom_killer.
>
> I'll try to explain the situation in detail:
>
> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
> in a different CRUSH "root", used as a cache tier for the main storage
> pools, which are erasure coded and used for cephfs. The OSDs are spread
> across two identical machines with 128GiB of RAM each, and there are three
> monitor nodes on different hardware.
>
> Several times we've encountered crippling bugs with previous Ceph releases
> when we were on RC or betas, or using non-recommended configurations, so in
> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and went with stable Kraken 11.2.0 with the configuration mentioned above.
> Everything was fine until the end of March, when one day we find all but a
> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> came along and nuked almost all the ceph-osd processes.
>
> We've gone through a bunch of iterations of restarting the OSDs, trying to
> bring them up one at a time gradually, all at once, various configuration
> settings to reduce cache size as suggested in this ticket:
> http://tracker.ceph.com/issues/18924...
>
> I don't know if that ticket really pertains to our situation or not, I
> have no experience with memory allocation debugging. I'd be willing to try
> if someone can point me to a guide or walk me through the process.
>
> I've even tried, just to see if the situation was  transitory, adding over
> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in
> a matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> oom_killer victims once again.
>
> No software or hardware changes took place around the time this problem
> started, and no significant data changes occurred either. We added about
> 40GiB of ~1GiB files a week or so before the problem started and that's the
> last time data was written.
>
> I can only assume we've found another crippling bug of some kind, this
> level of memory usage is entirely unprecedented. What can we do?
>
> Thanks in advance for any suggestions.
> -Aaron
>
>
> _______________________________________________
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to