Re: [ceph-users] Uneven CPU usage on OSD nodes

Somnath Roy Wed, 25 Mar 2015 09:35:10 -0700

Hi Fredrick,
See my response inline.

Thanks & Regards
Somnath

From: f...@univ-lr.fr [mailto:f...@univ-lr.fr]
Sent: Wednesday, March 25, 2015 8:07 AM
To: Somnath Roy
Cc: Ceph Users
Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes

Hi Somnath,

Thanks, the tcmalloc env variable trick definitely had an impact on 
FetchFromSpans calls.
    export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=1310851072; /etc/init.d/ceph 
stop; /etc/init.d/ceph start

Nevertheless, if these FetchFromSpans library calls activity is now even on all 
hosts, the CPU activity of the ceph-osd processes remains twice as high on 2 
hosts :
http://www.4shared.com/photo/3IP8jGPWba/UnevenLoad4-perf.html
http://www.4shared.com/photo/XX4C9NHTba/UnevenLoad4-top.html

and this can be observed under load of a benchmark or when idling too :
http://www.4shared.com/photo/x2Fl_in-ce/UnevenLoad4-top-idle.html

[Somnath] Hope you are using latest tcmalloc as I said there is a bug in 
tcmalloc coming with Ubuntu 14.04. Not sure about RHEL though. Nevertheless, 
the tcmalloc stuff went away it seems. Now, it is all about crc. As you can see 
(from perf top), the cpu usage for this crc calculation is taking more cpus on 
the two nodes. I guess that’s the difference now. Please turn off crc 
calculation by using the following config option.

        #ms_nocrc = true        ------- This is in Giant and prior
       //Following two for the latest master/hammer
        ms_crc_data = false
        ms_crc_header = false

The idle time cpu difference is not that bad. Need ‘perf top’ to see what is 
going on in idle time.

I'm now almost doubting of the values reported by the command 'top' as 'perf 
top' doesn't reveal major differences in calls ...

Could you elaborate on your sentence "saw the node consuming more cpus has more 
memory pressure as well"  ? You mean on your site ?
I can't see memory pressure on my hosts (~28GB available mem) but perhaps I'm 
not looking at the right thing. And no swap on the hosts.

[Somnath] In your previous screen shots, the node having more cpu usage was 
using more memory. The mem% reported by top is more against ceph-osds. That’s 
what I was pointing. But, now it is similar for both the cases.
Here is the osd tree leading to linear distribution I mentionned :

ceph osd tree
# id    weight    type name    up/down    reweight
-1    217.8    root default
-2    54.45        host siggy
0    3.63            osd.0    up    1
1    3.63            osd.1    up    1
2    3.63            osd.2    up    1
3    3.63            osd.3    up    1
4    3.63            osd.4    up    1
5    3.63            osd.5    up    1
6    3.63            osd.6    up    1
7    3.63            osd.7    up    1
8    3.63            osd.8    up    1
9    3.63            osd.9    up    1
10    3.63            osd.10    up    1
11    3.63            osd.11    up    1
12    3.63            osd.12    up    1
13    3.63            osd.13    up    1
14    3.63            osd.14    up    1
-3    54.45        host horik
15    3.63            osd.15    up    1
16    3.63            osd.16    up    1
17    3.63            osd.17    up    1
18    3.63            osd.18    up    1
19    3.63            osd.19    up    1
20    3.63            osd.20    up    1
21    3.63            osd.21    up    1
22    3.63            osd.22    up    1
23    3.63            osd.23    up    1
24    3.63            osd.24    up    1
25    3.63            osd.25    up    1
26    3.63            osd.26    up    1
27    3.63            osd.27    up    1
28    3.63            osd.28    up    1
29    3.63            osd.29    up    1
-4    54.45        host floki
30    3.63            osd.30    up    1
31    3.63            osd.31    up    1
32    3.63            osd.32    up    1
33    3.63            osd.33    up    1
34    3.63            osd.34    up    1
35    3.63            osd.35    up    1
36    3.63            osd.36    up    1
37    3.63            osd.37    up    1
38    3.63            osd.38    up    1
39    3.63            osd.39    up    1
40    3.63            osd.40    up    1
41    3.63            osd.41    up    1
42    3.63            osd.42    up    1
43    3.63            osd.43    up    1
44    3.63            osd.44    up    1
-5    54.45        host borg
45    3.63            osd.45    up    1
46    3.63            osd.46    up    1
47    3.63            osd.47    up    1
48    3.63            osd.48    up    1
49    3.63            osd.49    up    1
50    3.63            osd.50    up    1
51    3.63            osd.51    up    1
52    3.63            osd.52    up    1
53    3.63            osd.53    up    1
54    3.63            osd.54    up    1
55    3.63            osd.55    up    1
56    3.63            osd.56    up    1
57    3.63            osd.57    up    1
58    3.63            osd.58    up    1
59    3.63            osd.59    up    1

Regards,
Frederic

Somnath Roy <somnath....@sandisk.com><mailto:somnath....@sandisk.com> a écrit 
le 23/03/15 17:33 :
Yes, we are also facing similar issue on load (and running after some time). 
This is a tcmalloc behavior.
You can try setting the following env variable to a bigger value say 128MB or 
so.

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES

This env variable is supposed to alleviate the issue but what we found in the 
Ubuntu 14.04 version of tcmalloc this env variable is noop. This was a bug in 
tcmalloc which is been fixed in latest tcmalloc code base.
Not sure about RHEL though. In that case, you may want to try with latest 
tcmalloc. Just replacing LD_LIBRARY_PATH to the new tcmalloc location should 
work good.

Latest Ceph master has support for jemalloc and you may want to try with that 
if this is your test cluster.

Another point, I saw the node consuming more cpus has more memory pressure as 
well (and that’s why tcmalloc also having that issue). Can you give us output 
of ‘ceph osd tree’ to check if the load distribution is even ? Also, check if 
those systems are swapping or not.

Hope this helps.

Thanks & Regards
Somnath

From: f...@univ-lr.fr<mailto:f...@univ-lr.fr> [mailto:f...@univ-lr.fr]
Sent: Monday, March 23, 2015 4:31 AM
To: Somnath Roy
Cc: Ceph Users
Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes

Hi Somnath,

Thank you, please find my answers below

Somnath Roy <somnath....@sandisk.com><mailto:somnath....@sandisk.com> a écrit 
le 22/03/15 18:16 :
Hi Frederick,
Need some information here.

1. Just to clarify, you are saying it is happening g in 0.87.1 and not in 
Firefly ?
That's a possibility, others running similar hardware (and possibly OS, I can 
ask) confirm they dont have such visible comportment on Firefly.
I'd need to install Firefly on our hosts to be sure.
We run on RHEL.

2. Is it happening after some hours of run or just right away ?
It's happening on freshly installed hosts and goes on.

3. Please provide ‘perf top’ output of all the OSD nodes.
Here they are :
http://www.4shared.com/photo/S9tvbNKEce/UnevenLoad3-perf.html
http://www.4shared.com/photo/OHfiAtXKba/UnevenLoad3-top.html

The left-hand 'high-cpu' nodes have tmalloc calls able to explain the cpu 
difference. We don't see them on 'low-cpu' nodes :

12,15%  libtcmalloc.so.4.1.2      [.] tcmalloc::CentralFreeList::FetchFromSpans

4. Provide the ceph.conf file from your OSD node as well.
It's a basic configuration. FSID and IP are removed

[global]
fsid = 589xxxxxxxxxxxxxxxxxxxxxa9
mon_initial_members = helga
mon_host = X.Y.Z.64
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = X.Y.0.0/16

Regards,
Frederic

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
f...@univ-lr.fr<mailto:f...@univ-lr.fr>
Sent: Sunday, March 22, 2015 2:15 AM
To: Craig Lewis
Cc: Ceph Users
Subject: Re: [ceph-users] Uneven CPU usage on OSD nodes

Hi Craig,

An uneven primaries distribution was indeed my first thought.
I should have been more explicit on the percentages of the histograms I gave, 
lets see them in detail in a more comprehensive way.

On a 27938 bench objects seen by osdmap, the hosts are distributed like that :
20904 host1
21210 host2
20835 host3
20709 host3
That's the number of time they appear (as primary or secondary or tertiary).
The distribution is pretty linear, as we don't have more than 0.5% of total 
objects difference between the most and the less used host.

If we now considere the primary host distribution, here is what we have :
7207 host1
6960 host2
6814 host3
6957 host3
That's the number of time each host appears as primary.
Once again, the distribution is correct with less than 1.5% of the total 
entries between the most and the less used host as primary.
I must add that such a distribution is of course observed for the secondary and 
the tertiary copy.

I think we have enough samples to confirms the correct distribution of the 
crush function.
Each host having 25% of chance to be primary, this shouldn't be the reason why 
we observe a higher CPU load. There's must something else....

I must add we run 0.87.1 Giant.
Go to a firefly release is an option as the phenomena is not currently observed 
on comparable hardware platforms running 0.80.x
About the memory on hosts, 32GB is just a beginning for the tests. We'll add 
more later.

Frederic

Craig Lewis <cle...@centraldesktop.com><mailto:cle...@centraldesktop.com> a 
écrit le 20/03/15 23:19 :
I would say you're a little light on RAM.  With 4TB disks 70% full, I've seen 
some ceph-osd processes using 3.5GB of RAM during recovery.  You'll be fine 
during normal operation, but you might run into issues at the worst possible 
time.

I have 8 OSDs per node, and 32G of RAM.  I've had ceph-osd processes start 
swapping, and that's a great way to get them kicked out for being unresponsive.

I'm not a dev, but I can make some wild and uninformed guesses :-) .  The 
primary OSD uses more CPU than the replicas, and I suspect that you have more 
primaries on the hot nodes.

Since you're testing, try repeating the test on 3 OSD nodes instead of 4.  If 
you don't want to run that test, you can generate a histogram from ceph pg dump 
data, and see if there are more primary osds (the first one in the acting 
array) on the hot nodes.

On Wed, Mar 18, 2015 at 7:18 AM, f...@univ-lr.fr<mailto:f...@univ-lr.fr> 
<f...@univ-lr.fr<mailto:f...@univ-lr.fr>> wrote:
Hi to the ceph-users list !

We're setting up a new Ceph infrastructure :
- 1 MDS admin node
- 4 OSD storage nodes (60 OSDs)
  each of them running a monitor
- 1 client

Each 32GB RAM/16 cores OSD node supports 15 x 4TB SAS OSDs (XFS) and 1 SSD with 
5GB journal partitions, all in JBOD attachement.
Every node has 2x10Gb LACP attachement.
The OSD nodes are freshly installed with puppet then from the admin node
Default OSD weight in the OSD tree
1 test pool with 4096 PGs

During setup phase, we're trying to qualify the performance characteristics of 
our setup.
Rados benchmark are done from a client with these commandes :
rados -p pool -b 4194304 bench 60 write -t 32 --no-cleanup
rados -p pool -b 4194304 bench 60 seq -t 32 --no-cleanup

Each time we observed a recurring phenomena : 2 of the 4 OSD nodes have twice 
the CPU load :
http://www.4shared.com/photo/Ua0umPVbba/UnevenLoad.html
(What to look at is the real-time %CPU and the cumulated CPU time per ceph-osd 
process)

And after a fresh complete reinstall to be sure, this twice-as-high CPU load is 
observed but not on the same 2 nodes :
http://www.4shared.com/photo/2AJfd1B_ba/UnevenLoad-v2.html

Nothing obvious about the installation seems able to explain that.

The crush distribution function doesn't have more than 4.5% inequality between 
the 4 OSD nodes for the primary OSDs of the objects, and less than 3% between 
the hosts if we considere the whole acting sets for the objects used during the 
benchmark. And the differences are not accordingly comparable to the CPU loads. 
So the cause has to be elsewhere.

I cannot be sure it has no impact on performance. Even if we have enough CPU 
cores headroom, logic would say it has to have some consequences on delays and 
also on performances .

Would someone have any idea, or reproduce the test on its setup to see if this 
is a common comportment ?

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Uneven CPU usage on OSD nodes

Reply via email to