[ceph-users] OSD process exhausting server memory

Lukáš Kubín Thu, 30 Oct 2014 12:18:22 -0700

Nevermind, you helped me a lot by showing this OSD startup procedure
Michael. Big Thanks!


I seem to have made some progress now by setting the cache-mode to forward.
The OSD processes of SATA hosts stopped failing immediately. I'm now
waiting for the cache tier to flush. Then I'll try to enable recover and
backfill again to see if the cluster recovers.

Best greetings,

Lukas

On Thu, Oct 30, 2014 at 6:33 PM, Michael J. Kidd <michael.k...@inktank.com>
wrote:

> Hello Lukas,
>   Unfortunately, I'm all out of ideas at the moment.  There are some
> memory profiling techniques which can help identify what is causing the
> memory utilization, but it's a bit beyond what I typically work on.  Others
> on the list may have experience with this (or otherwise have ideas) and may
> chip in...
>
> Wish I could be more help..
>
> Michael J. Kidd
> Sr. Storage Consultant
> Inktank Professional Services
>  - by Red Hat
>
> On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín <lukas.ku...@gmail.com>
> wrote:
>
>> Thanks Michael, still no luck.
>>
>> Letting the problematic OSD.10 down has no effect. Within minutes more of
>> OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
>> two of those cache-tier OSDs on separate hosts which remain utilized almost
>> 200% CPU all the time
>>
>> I've performed upgrade of all cluster to 0.80.7. Did not help.
>>
>> I have also tried to unset norecovery+nobackfill flags to attempt a
>> recovery completion. No luck, several OSDs fail with the same issue
>> preventing the recovery to complete. I've performed your fix steps from the
>> start again and currently I'm behind the "unset noin" step.
>>
>> I could get some of pools to a state with no degraded objects
>> temporarily. Then (within minutes) some OSD fails and it's degraded again.
>>
>> I have also tried to let the OSD processes get restarted automatically to
>> keep them up as much as possible.
>>
>> I consider disabling the tiering pool "volumes-cache" as that's something
>> I can miss:
>>
>> pool name       category                 KB      objects       clones
>> degraded
>> backups         -                          0            0            0
>>          0
>> data            -                          0            0            0
>>          0
>> images          -                  777989590        95027            0
>>       8883
>> metadata        -                          0            0            0
>>          0
>> rbd             -                          0            0            0
>>          0
>> volumes         -                  115608693        25965          179
>>       3307
>> volumes-cache   -                  649577103     16708730         9894
>>    1144650
>>
>>
>> Can I just switch it into the forward mode and let it empty
>> (cache-flush-evict-all) to see if that changes anything?
>>
>> Could you or any of your colleagues provide anything else to try?
>>
>> Thank you,
>>
>> Lukas
>>
>>
>> On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd <
>> michael.k...@inktank.com> wrote:
>>
>>> Hello Lukas,
>>>   The 'slow request' logs are expected while the cluster is in such a
>>> state.. the OSD processes simply aren't able to respond quickly to client
>>> IO requests.
>>>
>>> I would recommend trying to recover without the most problematic disk (
>>> seems to be OSD.10? ).. Simply shut it down and see if the other OSDs
>>> settle down.  You should also take a look at the kernel logs for any
>>> indications of a problem with the disks themselves, or possibly do an FIO
>>> test against the drive with the OSD shut down (to a file on the OSD
>>> filesystem, not the raw drive.. this would be destructive).
>>>
>>> Also, you could upgrade to 0.80.7.  There are some bug fixes, but I'm
>>> not sure if any would specifically help this situation.. not likely to hurt
>>> through.
>>>
>>> The desired state is for the cluster to be steady-state before the next
>>> move (unsetting the next flag).  Hopefully this can be achieved without
>>> needing to take down OSDs in multiple hosts.
>>>
>>> I'm also unsure about the cache tiering and how it could relate to the
>>> load being seen.
>>>
>>> Hope this helps...
>>>
>>> Michael J. Kidd
>>> Sr. Storage Consultant
>>> Inktank Professional Services
>>>  - by Red Hat
>>>
>>> On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín <lukas.ku...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I've noticed the following messages always accumulate in OSD log before
>>>> it exhausts all memory:
>>>>
>>>> 2014-10-30 08:48:42.994190 7f80a2019700  0 log [WRN] : slow request
>>>> 38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
>>>> osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.000000000000363b@17
>>>> [copy-get max 8388608] 7.af87e887
>>>> ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently
>>>> reached pg
>>>>
>>>>
>>>> Note this is always from the most frequently failing osd.10 (sata tier)
>>>> referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and
>>>> memory resources, but keeps running without failures.
>>>>
>>>> Can this be eg. a bug? Or some erroneous I/O request which initiated
>>>> this behaviour? Can I eg. attempt to upgrade the Ceph to a more recent
>>>> release in the current unhealthy status of the cluster? Can I eg. try
>>>> disabling the caching tier? Or just somehow evacuate the problematic OSD?
>>>>
>>>> I'll welcome any ideas. Currently, I'm keeping the osd.10 in an
>>>> automatic restart loop with 60 seconds pause before starting again.
>>>>
>>>> Thanks and greetings,
>>>>
>>>> Lukas
>>>>
>>>> On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín <lukas.ku...@gmail.com>
>>>> wrote:
>>>>
>>>>> I should have figured that out myself since I did that recently.
>>>>> Thanks.
>>>>>
>>>>> Unfortunately, I'm still at the step "ceph osd unset noin". After
>>>>> setting all the OSDs in, the original issue reapears preventing me to
>>>>> proceed with recovery. It now appears mostly at single OSD - osd.10 which
>>>>> consumes ~200% CPU and all memory within 45 seconds being killed by Linux
>>>>> then:
>>>>>
>>>>> Oct 29 18:24:38 q09 kernel: Out of memory: Kill process 17202
>>>>> (ceph-osd) score 912 or sacrifice child
>>>>> Oct 29 18:24:38 q09 kernel: Killed process 17202, UID 0, (ceph-osd)
>>>>> total-vm:62713176kB, anon-rss:62009772kB, file-rss:328kB
>>>>>
>>>>>
>>>>> I've tried to restart it several times with same result. Similar
>>>>> situation with OSDs 0 and 13.
>>>>>
>>>>> Also, I've noticed one of SSD cache tier's OSD - osd.29 generating
>>>>> high CPU utilization around 180%.
>>>>>
>>>>> All the problematic OSD's have been the same ones all the time -  OSD
>>>>> 0,8,10,13 and 29 - they are those which I've found to be down this 
>>>>> morning.
>>>>>
>>>>> There is some minor load coming from client - Openstack instances, I
>>>>> preferred not to kill them:
>>>>>
>>>>> [root@q04 ceph-recovery]# ceph -s
>>>>>     cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
>>>>>      health HEALTH_ERR 31 pgs backfill; 241 pgs degraded; 62 pgs down;
>>>>> 193 pgs incomplete; 13 pgs inconsistent; 62 pgs peering; 12 pgs 
>>>>> recovering;
>>>>> 205 pgs recovery_wait; 93 pgs stuck inactive; 608 pgs stuck unclean; 
>>>>> 381138
>>>>> requests are blocked > 32 sec; recovery 1162468/35207488 objects degraded
>>>>> (3.302%); 466/17112963 unfound (0.003%); 13 scrub errors; 1/34 in osds are
>>>>> down; nobackfill,norecover,noscrub,nodeep-scrub flag(s) set
>>>>>      monmap e2: 3 mons at {q03=
>>>>> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
>>>>> election epoch 92, quorum 0,1,2 q03,q04,q05
>>>>>      osdmap e2782: 34 osds: 33 up, 34 in
>>>>>             flags nobackfill,norecover,noscrub,nodeep-scrub
>>>>>       pgmap v7440374: 5632 pgs, 7 pools, 1449 GB data, 16711 kobjects
>>>>>             3148 GB used, 15010 GB / 18158 GB avail
>>>>>             1162468/35207488 objects degraded (3.302%); 466/17112963
>>>>> unfound (0.003%)
>>>>>                   13 active
>>>>>                   22 active+recovery_wait+remapped
>>>>>                    1 active+recovery_wait+inconsistent
>>>>>                 4794 active+clean
>>>>>                  193 incomplete
>>>>>                   62 down+peering
>>>>>                    9 active+degraded+remapped+wait_backfill
>>>>>                  182 active+recovery_wait
>>>>>                   74 active+remapped
>>>>>                   12 active+recovering
>>>>>                   12 active+clean+inconsistent
>>>>>                   22 active+remapped+wait_backfill
>>>>>                    4 active+clean+replay
>>>>>                  232 active+degraded
>>>>>   client io 0 B/s rd, 1048 kB/s wr, 184 op/s
>>>>>
>>>>>
>>>>> Below I'm sending the requested output.
>>>>>
>>>>> Do you have any other ideas how to recover from this?
>>>>>
>>>>> Thanks a lot.
>>>>>
>>>>> Lukas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [root@q04 ceph-recovery]# ceph osd crush rule dump
>>>>> [
>>>>>     { "rule_id": 0,
>>>>>       "rule_name": "replicated_ruleset",
>>>>>       "ruleset": 0,
>>>>>       "type": 1,
>>>>>       "min_size": 1,
>>>>>       "max_size": 10,
>>>>>       "steps": [
>>>>>             { "op": "take",
>>>>>               "item": -1,
>>>>>               "item_name": "default"},
>>>>>             { "op": "chooseleaf_firstn",
>>>>>               "num": 0,
>>>>>               "type": "host"},
>>>>>             { "op": "emit"}]},
>>>>>     { "rule_id": 1,
>>>>>       "rule_name": "ssd",
>>>>>       "ruleset": 1,
>>>>>       "type": 1,
>>>>>       "min_size": 1,
>>>>>       "max_size": 10,
>>>>>       "steps": [
>>>>>             { "op": "take",
>>>>>               "item": -5,
>>>>>               "item_name": "ssd"},
>>>>>             { "op": "chooseleaf_firstn",
>>>>>               "num": 0,
>>>>>               "type": "host"},
>>>>>             { "op": "emit"}]},
>>>>>     { "rule_id": 2,
>>>>>       "rule_name": "sata",
>>>>>       "ruleset": 2,
>>>>>       "type": 1,
>>>>>       "min_size": 1,
>>>>>       "max_size": 10,
>>>>>       "steps": [
>>>>>             { "op": "take",
>>>>>               "item": -4,
>>>>>               "item_name": "sata"},
>>>>>             { "op": "chooseleaf_firstn",
>>>>>               "num": 0,
>>>>>               "type": "host"},
>>>>>             { "op": "emit"}]}]
>>>>>
>>>>> [root@q04 ceph-recovery]# ceph osd dump | grep pool
>>>>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 2 object_hash
>>>>> rjenkins pg_num 512 pgp_num 512 last_change 630 flags hashpspool
>>>>> crash_replay_interval 45 stripe_width 0
>>>>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 2
>>>>> object_hash rjenkins pg_num 512 pgp_num 512 last_change 632 flags
>>>>> hashpspool stripe_width 0
>>>>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
>>>>> rjenkins pg_num 512 pgp_num 512 last_change 634 flags hashpspool
>>>>> stripe_width 0
>>>>> pool 7 'volumes' replicated size 2 min_size 2 crush_ruleset 0
>>>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags
>>>>> hashpspool tiers 14 read_tier 14 write_tier 14 stripe_width 0
>>>>> pool 8 'images' replicated size 2 min_size 2 crush_ruleset 0
>>>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1519 flags
>>>>> hashpspool stripe_width 0
>>>>> pool 12 'backups' replicated size 2 min_size 1 crush_ruleset 0
>>>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 862 flags
>>>>> hashpspool stripe_width 0
>>>>> pool 14 'volumes-cache' replicated size 2 min_size 1 crush_ruleset 1
>>>>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1517 flags
>>>>> hashpspool tier_of 7 cache_mode writeback target_bytes 1000000000000
>>>>> hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0}
>>>>> 3600s x1 stripe_width 0
>>>>>
>>>>>
>>>>> On Wed, Oct 29, 2014 at 6:43 PM, Michael J. Kidd <
>>>>> michael.k...@inktank.com> wrote:
>>>>>
>>>>>> Ah, sorry... since they were set out manually, they'll need to be set
>>>>>> in manually..
>>>>>>
>>>>>> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd
>>>>>> in $i; done
>>>>>>
>>>>>>
>>>>>>
>>>>>> Michael J. Kidd
>>>>>> Sr. Storage Consultant
>>>>>> Inktank Professional Services
>>>>>>  - by Red Hat
>>>>>>
>>>>>> On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín <lukas.ku...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I've ended up at step "ceph osd unset noin". My OSDs are up, but
>>>>>>> not in, even after an hour:
>>>>>>>
>>>>>>> [root@q04 ceph-recovery]# ceph osd stat
>>>>>>>      osdmap e2602: 34 osds: 34 up, 0 in
>>>>>>>             flags nobackfill,norecover,noscrub,nodeep-scrub
>>>>>>>
>>>>>>>
>>>>>>> There seems to be no activity generated by OSD processes,
>>>>>>> occasionally they show 0,3% which I believe is just some basic
>>>>>>> communication processing. No load in network interfaces.
>>>>>>>
>>>>>>> Is there some other step needed to bring the OSDs in?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Lukas
>>>>>>>
>>>>>>> On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd <
>>>>>>> michael.k...@inktank.com> wrote:
>>>>>>>
>>>>>>>> Hello Lukas,
>>>>>>>>   Please try the following process for getting all your OSDs up and
>>>>>>>> operational...
>>>>>>>>
>>>>>>>> * Set the following flags: noup, noin, noscrub, nodeep-scrub,
>>>>>>>> norecover, nobackfill
>>>>>>>> for i in noup noin noscrub nodeep-scrub norecover nobackfill; do
>>>>>>>> ceph osd set $i; done
>>>>>>>>
>>>>>>>> * Stop all OSDs (I know, this seems counter productive)
>>>>>>>> * Set all OSDs down / out
>>>>>>>> for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph
>>>>>>>> osd down $i; ceph osd out $i; done
>>>>>>>> * Set recovery / backfill throttles as well as heartbeat and OSD
>>>>>>>> map processing tweaks in the /etc/ceph/ceph.conf file under the [osd]
>>>>>>>> section:
>>>>>>>> [osd]
>>>>>>>> osd_max_backfills = 1
>>>>>>>> osd_recovery_max_active = 1
>>>>>>>> osd_recovery_max_single_start = 1
>>>>>>>> osd_backfill_scan_min = 8
>>>>>>>> osd_heartbeat_interval = 36
>>>>>>>> osd_heartbeat_grace = 240
>>>>>>>> osd_map_message_max = 1000
>>>>>>>> osd_map_cache_size = 3136
>>>>>>>>
>>>>>>>> * Start all OSDs
>>>>>>>> * Monitor 'top' for 0% CPU on all OSD processes.. it may take a
>>>>>>>> while..  I usually issue 'top' then, the keys M c
>>>>>>>>  - M = Sort by memory usage
>>>>>>>>  - c = Show command arguments
>>>>>>>>  - This allows to easily monitor the OSD process and know which
>>>>>>>> OSDs have settled, etc..
>>>>>>>> * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag
>>>>>>>>  - ceph osd unset noup
>>>>>>>> * Again, wait for 0% CPU utilization (may  be immediate, may take a
>>>>>>>> while.. just gotta wait)
>>>>>>>> * Once all OSDs have hit 0% CPU again, remove the 'noin' flag
>>>>>>>>  - ceph osd unset noin
>>>>>>>>  - All OSDs should now appear up/in, and will go through peering..
>>>>>>>> * Once ceph -s shows no further activity, and OSDs are back at 0%
>>>>>>>> CPU again, unset 'nobackfill'
>>>>>>>>  - ceph osd unset nobackfill
>>>>>>>> * Once ceph -s shows no further activity, and OSDs are back at 0%
>>>>>>>> CPU again, unset 'norecover'
>>>>>>>>  - ceph osd unset norecover
>>>>>>>> * Monitor OSD memory usage... some OSDs may get killed off again,
>>>>>>>> but their subsequent restart should consume less memory and allow more
>>>>>>>> recovery to occur between each step above.. and ultimately, 
>>>>>>>> hopefully...
>>>>>>>> your entire cluster will come back online and be usable.
>>>>>>>>
>>>>>>>> ## Clean-up:
>>>>>>>> * Remove all of the above set options from ceph.conf
>>>>>>>> * Reset the running OSDs to their defaults:
>>>>>>>> ceph tell osd.\* injectargs '--osd_max_backfills 10
>>>>>>>> --osd_recovery_max_active 15 --osd_recovery_max_single_start 5
>>>>>>>> --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 
>>>>>>>> --osd_heartbeat_grace
>>>>>>>> 36 --osd_map_message_max 100 --osd_map_cache_size 500'
>>>>>>>> * Unset the noscrub and nodeep-scrub flags:
>>>>>>>>  - ceph osd unset noscrub
>>>>>>>>  - ceph osd unset nodeep-scrub
>>>>>>>>
>>>>>>>>
>>>>>>>> ## For help identifying why memory usage was so high, please
>>>>>>>> provide:
>>>>>>>> * ceph osd dump | grep pool
>>>>>>>> * ceph osd crush rule dump
>>>>>>>>
>>>>>>>> Let us know if this helps... I know it looks extreme, but it's
>>>>>>>> worked for me in the past..
>>>>>>>>
>>>>>>>>
>>>>>>>> Michael J. Kidd
>>>>>>>> Sr. Storage Consultant
>>>>>>>> Inktank Professional Services
>>>>>>>>  - by Red Hat
>>>>>>>>
>>>>>>>> On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín <lukas.ku...@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs
>>>>>>>>> being down through night after months of running without change. 
>>>>>>>>> >From Linux
>>>>>>>>> logs I found out the OSD processes were killed because they consumed 
>>>>>>>>> all
>>>>>>>>> available memory.
>>>>>>>>>
>>>>>>>>> Those 5 failed OSDs were from different hosts of my 4-node cluster
>>>>>>>>> (see below). Two hosts act as SSD cache tier in some of my pools. The 
>>>>>>>>> other
>>>>>>>>> two hosts are the default rotational drives storage.
>>>>>>>>>
>>>>>>>>> After checking the Linux was not out of memory I've attempted to
>>>>>>>>> restart those failed OSDs. Most of those OSD daemon exhaust all 
>>>>>>>>> memory in
>>>>>>>>> seconds and got killed by Linux again:
>>>>>>>>>
>>>>>>>>> Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207
>>>>>>>>> (ceph-osd) score 867 or sacrifice child
>>>>>>>>> Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0,
>>>>>>>>> (ceph-osd) total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On the host I've found lots of similar "slow request" messages
>>>>>>>>> preceding the crash:
>>>>>>>>>
>>>>>>>>> 2014-10-28 22:11:20.885527 7f25f84d1700  0 log [WRN] : slow
>>>>>>>>> request 31.117125 seconds old, received at 2014-10-28 22:10:49.768291:
>>>>>>>>> osd_sub_op(client.168752.0:2197931 14.2c7
>>>>>>>>> 888596c7/rbd_data.293272f8695e4.000000000000006f/head//14 [] v 
>>>>>>>>> 1551'377417
>>>>>>>>> snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached
>>>>>>>>> 2014-10-28 22:11:21.885668 7f25f84d1700  0 log [WRN] : 67 slow
>>>>>>>>> requests, 1 included below; oldest blocked for > 9879.304770 secs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Apparently I can't get the cluster fixed by restarting the OSDs
>>>>>>>>> all over again. Is there any other option then?
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>> Lukas Kubin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [root@q04 ~]# ceph -s
>>>>>>>>>     cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99
>>>>>>>>>      health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs
>>>>>>>>> degraded; 425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 
>>>>>>>>> 50
>>>>>>>>> pgs recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs 
>>>>>>>>> stuck
>>>>>>>>> stale; 1164 pgs stuck unclean; 12070270 requests are blocked > 32 sec;
>>>>>>>>> recovery 887322/35206223 objects degraded (2.520%); 119/17131232 
>>>>>>>>> unfound
>>>>>>>>> (0.001%); 13 scrub errors
>>>>>>>>>      monmap e2: 3 mons at {q03=
>>>>>>>>> 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0},
>>>>>>>>> election epoch 90, quorum 0,1,2 q03,q04,q05
>>>>>>>>>      osdmap e2194: 34 osds: 31 up, 31 in
>>>>>>>>>       pgmap v7429812: 5632 pgs, 7 pools, 1446 GB data, 16729
>>>>>>>>> kobjects
>>>>>>>>>             2915 GB used, 12449 GB / 15365 GB avail
>>>>>>>>>             887322/35206223 objects degraded (2.520%);
>>>>>>>>> 119/17131232 unfound (0.001%)
>>>>>>>>>                   38 active+recovery_wait+remapped
>>>>>>>>>                 4455 active+clean
>>>>>>>>>                   65 stale+incomplete
>>>>>>>>>                    3 active+recovering+remapped
>>>>>>>>>                  359 incomplete
>>>>>>>>>                   12 active+recovery_wait
>>>>>>>>>                  139 active+remapped
>>>>>>>>>                   86 stale+active+degraded
>>>>>>>>>                   16 active+recovering
>>>>>>>>>                    1 active+remapped+backfilling
>>>>>>>>>                   13 active+clean+inconsistent
>>>>>>>>>                    9 active+remapped+wait_backfill
>>>>>>>>>                  434 active+degraded
>>>>>>>>>                    1 remapped+incomplete
>>>>>>>>>                    1 active+recovering+degraded+remapped
>>>>>>>>>   client io 0 B/s rd, 469 kB/s wr, 48 op/s
>>>>>>>>>
>>>>>>>>> [root@q04 ~]# ceph osd tree
>>>>>>>>> # id    weight  type name       up/down reweight
>>>>>>>>> -5      3.24    root ssd
>>>>>>>>> -6      1.62            host q06
>>>>>>>>> 16      0.18                    osd.16  up      1
>>>>>>>>> 17      0.18                    osd.17  up      1
>>>>>>>>> 18      0.18                    osd.18  up      1
>>>>>>>>> 19      0.18                    osd.19  up      1
>>>>>>>>> 20      0.18                    osd.20  up      1
>>>>>>>>> 21      0.18                    osd.21  up      1
>>>>>>>>> 22      0.18                    osd.22  up      1
>>>>>>>>> 23      0.18                    osd.23  up      1
>>>>>>>>> 24      0.18                    osd.24  up      1
>>>>>>>>> -7      1.62            host q07
>>>>>>>>> 25      0.18                    osd.25  up      1
>>>>>>>>> 26      0.18                    osd.26  up      1
>>>>>>>>> 27      0.18                    osd.27  up      1
>>>>>>>>> 28      0.18                    osd.28  up      1
>>>>>>>>> 29      0.18                    osd.29  up      1
>>>>>>>>> 30      0.18                    osd.30  up      1
>>>>>>>>> 31      0.18                    osd.31  up      1
>>>>>>>>> 32      0.18                    osd.32  up      1
>>>>>>>>> 33      0.18                    osd.33  up      1
>>>>>>>>> -1      14.56   root default
>>>>>>>>> -4      14.56           root sata
>>>>>>>>> -2      7.28                    host q08
>>>>>>>>> 0       0.91                            osd.0   up      1
>>>>>>>>> 1       0.91                            osd.1   up      1
>>>>>>>>> 2       0.91                            osd.2   up      1
>>>>>>>>> 3       0.91                            osd.3   up      1
>>>>>>>>> 11      0.91                            osd.11  up      1
>>>>>>>>> 12      0.91                            osd.12  up      1
>>>>>>>>> 13      0.91                            osd.13  down    0
>>>>>>>>> 14      0.91                            osd.14  up      1
>>>>>>>>> -3      7.28                    host q09
>>>>>>>>> 4       0.91                            osd.4   up      1
>>>>>>>>> 5       0.91                            osd.5   up      1
>>>>>>>>> 6       0.91                            osd.6   up      1
>>>>>>>>> 7       0.91                            osd.7   up      1
>>>>>>>>> 8       0.91                            osd.8   down    0
>>>>>>>>> 9       0.91                            osd.9   up      1
>>>>>>>>> 10      0.91                            osd.10  down    0
>>>>>>>>> 15      0.91                            osd.15  up      1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD process exhausting server memory

Reply via email to