[ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-15 Thread Aaron Ten Clay
Hi all,

Our cluster is experiencing a very odd issue and I'm hoping for some
guidance on troubleshooting steps and/or suggestions to mitigate the issue.
tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
eventually nuked by oom_killer.

I'll try to explain the situation in detail:

We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
in a different CRUSH "root", used as a cache tier for the main storage
pools, which are erasure coded and used for cephfs. The OSDs are spread
across two identical machines with 128GiB of RAM each, and there are three
monitor nodes on different hardware.

Several times we've encountered crippling bugs with previous Ceph releases
when we were on RC or betas, or using non-recommended configurations, so in
January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
and went with stable Kraken 11.2.0 with the configuration mentioned above.
Everything was fine until the end of March, when one day we find all but a
couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
came along and nuked almost all the ceph-osd processes.

We've gone through a bunch of iterations of restarting the OSDs, trying to
bring them up one at a time gradually, all at once, various configuration
settings to reduce cache size as suggested in this ticket:
http://tracker.ceph.com/issues/18924...

I don't know if that ticket really pertains to our situation or not, I have
no experience with memory allocation debugging. I'd be willing to try if
someone can point me to a guide or walk me through the process.

I've even tried, just to see if the situation was  transitory, adding over
300GiB of swap to both OSD machines. The OSD procs managed to allocate, in
a matter of 5-10 minutes, more than 300GiB of RAM pressure and became
oom_killer victims once again.

No software or hardware changes took place around the time this problem
started, and no significant data changes occurred either. We added about
40GiB of ~1GiB files a week or so before the problem started and that's the
last time data was written.

I can only assume we've found another crippling bug of some kind, this
level of memory usage is entirely unprecedented. What can we do?

Thanks in advance for any suggestions.
-Aaron
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS failover

2017-04-15 Thread Gandalf Corvotempesta
Hi to all
Sorry if this question was already asked but I didn't find anything related


AFAIK MDS are a foundametal component for CephFS.
What happens in case of MDS crash between replication from the active MDS
to the slaves?
Changes made between the crash and the missing replication are lost?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-15 Thread Peter Maloney
How many PGs do you have? And did you change any config, like mds cache
size? Show your ceph.conf.

On 04/15/17 07:34, Aaron Ten Clay wrote:
> Hi all,
>
> Our cluster is experiencing a very odd issue and I'm hoping for some
> guidance on troubleshooting steps and/or suggestions to mitigate the
> issue. tl;dr: Individual ceph-osd processes try to allocate > 90GiB of
> RAM and are eventually nuked by oom_killer.
>
> I'll try to explain the situation in detail:
>
> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
> are in a different CRUSH "root", used as a cache tier for the main
> storage pools, which are erasure coded and used for cephfs. The OSDs
> are spread across two identical machines with 128GiB of RAM each, and
> there are three monitor nodes on different hardware.
>
> Several times we've encountered crippling bugs with previous Ceph
> releases when we were on RC or betas, or using non-recommended
> configurations, so in January we abandoned all previous Ceph usage,
> deployed LTS Ubuntu 16.04, and went with stable Kraken 11.2.0 with the
> configuration mentioned above. Everything was fine until the end of
> March, when one day we find all but a couple of OSDs are "down"
> inexplicably. Investigation reveals oom_killer came along and nuked
> almost all the ceph-osd processes.
>
> We've gone through a bunch of iterations of restarting the OSDs,
> trying to bring them up one at a time gradually, all at once, various
> configuration settings to reduce cache size as suggested in this
> ticket: http://tracker.ceph.com/issues/18924...
>
> I don't know if that ticket really pertains to our situation or not, I
> have no experience with memory allocation debugging. I'd be willing to
> try if someone can point me to a guide or walk me through the process.
>
> I've even tried, just to see if the situation was  transitory, adding
> over 300GiB of swap to both OSD machines. The OSD procs managed to
> allocate, in a matter of 5-10 minutes, more than 300GiB of RAM
> pressure and became oom_killer victims once again.
>
> No software or hardware changes took place around the time this
> problem started, and no significant data changes occurred either. We
> added about 40GiB of ~1GiB files a week or so before the problem
> started and that's the last time data was written.
>
> I can only assume we've found another crippling bug of some kind, this
> level of memory usage is entirely unprecedented. What can we do?
>
> Thanks in advance for any suggestions.
> -Aaron
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS failover

2017-04-15 Thread John Spray
On Sat, Apr 15, 2017 at 8:49 AM, Gandalf Corvotempesta
 wrote:
> Hi to all
> Sorry if this question was already asked but I didn't find anything related
>
>
> AFAIK MDS are a foundametal component for CephFS.
> What happens in case of MDS crash between replication from the active MDS to
> the slaves?

MDSs do not replicate to one another.  They write all metadata to a
RADOS pool (i.e. to the OSDs), and when a failover happens, the new
active MDS reads the metadata in.

> Changes made between the crash and the missing replication are lost?

Nothing is lost -- the clients will be briefly unable to do any
metadata operations while they wait for the replacement MDS to become
active.  They can continue to do data operations on files they have
open in the meantime.

John

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-15 Thread Bob R
I'd recommend running through these steps and posting the output as well
http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

Bob

On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <
peter.malo...@brockmann-consult.de> wrote:

> How many PGs do you have? And did you change any config, like mds cache
> size? Show your ceph.conf.
>
>
> On 04/15/17 07:34, Aaron Ten Clay wrote:
>
> Hi all,
>
> Our cluster is experiencing a very odd issue and I'm hoping for some
> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
> eventually nuked by oom_killer.
>
> I'll try to explain the situation in detail:
>
> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
> in a different CRUSH "root", used as a cache tier for the main storage
> pools, which are erasure coded and used for cephfs. The OSDs are spread
> across two identical machines with 128GiB of RAM each, and there are three
> monitor nodes on different hardware.
>
> Several times we've encountered crippling bugs with previous Ceph releases
> when we were on RC or betas, or using non-recommended configurations, so in
> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and went with stable Kraken 11.2.0 with the configuration mentioned above.
> Everything was fine until the end of March, when one day we find all but a
> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> came along and nuked almost all the ceph-osd processes.
>
> We've gone through a bunch of iterations of restarting the OSDs, trying to
> bring them up one at a time gradually, all at once, various configuration
> settings to reduce cache size as suggested in this ticket:
> http://tracker.ceph.com/issues/18924...
>
> I don't know if that ticket really pertains to our situation or not, I
> have no experience with memory allocation debugging. I'd be willing to try
> if someone can point me to a guide or walk me through the process.
>
> I've even tried, just to see if the situation was  transitory, adding over
> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in
> a matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> oom_killer victims once again.
>
> No software or hardware changes took place around the time this problem
> started, and no significant data changes occurred either. We added about
> 40GiB of ~1GiB files a week or so before the problem started and that's the
> last time data was written.
>
> I can only assume we've found another crippling bug of some kind, this
> level of memory usage is entirely unprecedented. What can we do?
>
> Thanks in advance for any suggestions.
> -Aaron
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-15 Thread Aaron Ten Clay
Peter,

There are 624 PGs across 4 pools:

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 2505 flags hashpspool
stripe_width 0
removed_snaps [1~3]
pool 3 'fsdata' erasure size 14 min_size 11 crush_ruleset 3 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 154 lfor 153 flags hashpspool
crash_replay_interval 45 tiers 5 read_tier 5 write_tier 5 stripe_width 4160
pool 4 'fsmeta' replicated size 4 min_size 3 crush_ruleset 0 object_hash
rjenkins pg_num 16 pgp_num 16 last_change 144 flags hashpspool stripe_width
0
pool 5 'fscache' replicated size 3 min_size 2 crush_ruleset 4 object_hash
rjenkins pg_num 32 pgp_num 32 last_change 1016 flags
hashpspool,incomplete_clones tier_of 3 cache_mode writeback target_bytes
1000 hit_set bloom{false_positive_probability: 0.05, target_size:
0, seed: 0} 86400s x4 decay_rate 0 search_last_n 0 stripe_width 0


Here's the ceph.conf. We're back to no extra configuration for bluestore
caching, but previously we had attempted setting the directive
bluestore_cache_size as low as 1073741.

[global]
fsid=
c4b3b4ec-fbc2-4861-913f-295ff64f70ad
auth client required= cephx
auth cluster required   = cephx
auth service required   = cephx

cephx require signatures= true

public network  =
10.42.0.0/16
cluster network =
10.43.100.0/24

mon_initial_members = benjamin,
jake, jennifer
mon_host=
10.42.5.38,10.42.5.37,10.42.5.36

[osd]
osd crush update on start   = false


Thanks,
-Aaron

On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <
peter.malo...@brockmann-consult.de> wrote:

> How many PGs do you have? And did you change any config, like mds cache
> size? Show your ceph.conf.
>
>
> On 04/15/17 07:34, Aaron Ten Clay wrote:
>
> Hi all,
>
> Our cluster is experiencing a very odd issue and I'm hoping for some
> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
> eventually nuked by oom_killer.
>
> I'll try to explain the situation in detail:
>
> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
> in a different CRUSH "root", used as a cache tier for the main storage
> pools, which are erasure coded and used for cephfs. The OSDs are spread
> across two identical machines with 128GiB of RAM each, and there are three
> monitor nodes on different hardware.
>
> Several times we've encountered crippling bugs with previous Ceph releases
> when we were on RC or betas, or using non-recommended configurations, so in
> January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
> and went with stable Kraken 11.2.0 with the configuration mentioned above.
> Everything was fine until the end of March, when one day we find all but a
> couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
> came along and nuked almost all the ceph-osd processes.
>
> We've gone through a bunch of iterations of restarting the OSDs, trying to
> bring them up one at a time gradually, all at once, various configuration
> settings to reduce cache size as suggested in this ticket:
> http://tracker.ceph.com/issues/18924...
>
> I don't know if that ticket really pertains to our situation or not, I
> have no experience with memory allocation debugging. I'd be willing to try
> if someone can point me to a guide or walk me through the process.
>
> I've even tried, just to see if the situation was  transitory, adding over
> 300GiB of swap to both OSD machines. The OSD procs managed to allocate, in
> a matter of 5-10 minutes, more than 300GiB of RAM pressure and became
> oom_killer victims once again.
>
> No software or hardware changes took place around the time this problem
> started, and no significant data changes occurred either. We added about
> 40GiB of ~1GiB files a week or so before the problem started and that's the
> last time data was written.
>
> I can only assume we've found another crippling bug of some kind, this
> level of memory usage is entirely unprecedented. What can we do?
>
> Thanks in advance for any suggestions.
> -Aaron
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>


-- 
Aaron Ten Clay
https://aarontc.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

2017-04-15 Thread Aaron Ten Clay
Thanks for the recommendation, Bob! I'll try to get this data later today
and reply with it.

-Aaron

On Sat, Apr 15, 2017 at 9:46 AM, Bob R  wrote:

> I'd recommend running through these steps and posting the output as well
> http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
> Bob
>
> On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney  consult.de> wrote:
>
>> How many PGs do you have? And did you change any config, like mds cache
>> size? Show your ceph.conf.
>>
>>
>> On 04/15/17 07:34, Aaron Ten Clay wrote:
>>
>> Hi all,
>>
>> Our cluster is experiencing a very odd issue and I'm hoping for some
>> guidance on troubleshooting steps and/or suggestions to mitigate the issue.
>> tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and are
>> eventually nuked by oom_killer.
>>
>> I'll try to explain the situation in detail:
>>
>> We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs are
>> in a different CRUSH "root", used as a cache tier for the main storage
>> pools, which are erasure coded and used for cephfs. The OSDs are spread
>> across two identical machines with 128GiB of RAM each, and there are three
>> monitor nodes on different hardware.
>>
>> Several times we've encountered crippling bugs with previous Ceph
>> releases when we were on RC or betas, or using non-recommended
>> configurations, so in January we abandoned all previous Ceph usage,
>> deployed LTS Ubuntu 16.04, and went with stable Kraken 11.2.0 with the
>> configuration mentioned above. Everything was fine until the end of March,
>> when one day we find all but a couple of OSDs are "down" inexplicably.
>> Investigation reveals oom_killer came along and nuked almost all the
>> ceph-osd processes.
>>
>> We've gone through a bunch of iterations of restarting the OSDs, trying
>> to bring them up one at a time gradually, all at once, various
>> configuration settings to reduce cache size as suggested in this ticket:
>> http://tracker.ceph.com/issues/18924...
>>
>> I don't know if that ticket really pertains to our situation or not, I
>> have no experience with memory allocation debugging. I'd be willing to try
>> if someone can point me to a guide or walk me through the process.
>>
>> I've even tried, just to see if the situation was  transitory, adding
>> over 300GiB of swap to both OSD machines. The OSD procs managed to
>> allocate, in a matter of 5-10 minutes, more than 300GiB of RAM pressure and
>> became oom_killer victims once again.
>>
>> No software or hardware changes took place around the time this problem
>> started, and no significant data changes occurred either. We added about
>> 40GiB of ~1GiB files a week or so before the problem started and that's the
>> last time data was written.
>>
>> I can only assume we've found another crippling bug of some kind, this
>> level of memory usage is entirely unprecedented. What can we do?
>>
>> Thanks in advance for any suggestions.
>> -Aaron
>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 
Aaron Ten Clay
https://aarontc.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS failover

2017-04-15 Thread Gandalf Corvotempesta
Il 15 apr 2017 5:48 PM, "John Spray"  ha scritto:

MDSs do not replicate to one another.  They write all metadata to a
RADOS pool (i.e. to the OSDs), and when a failover happens, the new
active MDS reads the metadata in.


Is MDS atomic? A successful ack is sent only after data is properly wrote
on radios pool or is wrote in background
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS failover

2017-04-15 Thread John Spray
On Sat, Apr 15, 2017 at 7:19 PM, Gandalf Corvotempesta
 wrote:
> Il 15 apr 2017 5:48 PM, "John Spray"  ha scritto:
>
> MDSs do not replicate to one another.  They write all metadata to a
> RADOS pool (i.e. to the OSDs), and when a failover happens, the new
> active MDS reads the metadata in.
>
>
> Is MDS atomic? A successful ack is sent only after data is properly wrote on
> radios pool or is wrote in background

Your client mount gives you POSIX semantics -- it obeys the normal
rules around fsync, etc.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com