Re: [ceph-users] Mimic offline problem

Goktug Yildirim Wed, 03 Oct 2018 18:35:42 -0700

This is our cluster state right now. I can reach rbd list and thats good! 
Thanks a lot Sage!!!
ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/


As you can see we have 2 unfound pg since some of our OSDs can not start. 58 
OSD gives different errors.
How can I fix these OSD's? If I remember correctly it should not be so much 
trouble.

These are OSDs' failed logs.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/
https://paste.ubuntu.com/p/pkRdVjCH4D/
https://paste.ubuntu.com/p/zJTf2fzSj9/
https://paste.ubuntu.com/p/xpJRK6YhRX/
https://paste.ubuntu.com/p/SY3576dNbJ/
https://paste.ubuntu.com/p/smyT6Y976b/

> On 3 Oct 2018, at 21:37, Sage Weil <s...@newdream.net> wrote:
> 
> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> I'm so sorry about that I missed "out" parameter. My bad..
>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> 
> Excellent, thanks.  That looks like it confirms the problem is that teh 
> recovery tool didn't repopulate the creating pgs properly.
> 
> If you take that 30 byte file I sent earlier (as hex) and update the 
> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> properly, and then inject it on the 3 mons, that should get you past this 
> hump (and hopefully back up!).
> 
> sage
> 
> 
>> 
>> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 21:13):
>> 
>>> I bet the kvstore output it in a hexdump format?  There is another option 
>>> to get the raw data iirc
>>> 
>>> 
>>> 
>>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
>>>> <goktug.yildi...@gmail.com> wrote:
>>>> I changed the file name to make it clear.
>>>> When I use your command with "+decode"  I'm getting an error like this:
>>>> 
>>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>>>> error: buffer::malformed_input: void 
>>>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand 
>>>> old encoding version 2 < 111
>>>> 
>>>> My ceph version: 13.2.2
>>>> 
>>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <s...@newdream.net> şunu 
>>>> yazdı:
>>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>> If I didn't do it wrong, I got the output as below.
>>>>>> 
>>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ 
>>>>>> get osd_pg_creating creating > dump
>>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
>>>>>> families: [default]
>>>>>> 
>>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
>>>>> 
>>>>> Sorry, should be
>>>>> 
>>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>>>> 
>>>>> s
>>>>> 
>>>>>> {
>>>>>>    "last_scan_epoch": 0,
>>>>>>    "creating_pgs": [],
>>>>>>    "queue": [],
>>>>>>    "created_pools": []
>>>>>> }
>>>>>> 
>>>>>> You can find the "dump" link below.
>>>>>> 
>>>>>> dump: 
>>>>>> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>>>>> 
>>>>>> 
>>>>>> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 18:45):
>>>>>> 
>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>> We are starting to work on it. First step is getting the structure out 
>>>>>>>> and dumping the current value as you say.
>>>>>>>> 
>>>>>>>> And you were correct we did not run force_create_pg.
>>>>>>> 
>>>>>>> Great.
>>>>>>> 
>>>>>>> So, eager to see what the current structure is... please attach once 
>>>>>>> you 
>>>>>>> have it.
>>>>>>> 
>>>>>>> The new replacement one should look like this (when hexdump -C'd):
>>>>>>> 
>>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
>>>>>>> |................|
>>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        
>>>>>>> |..B...........|
>>>>>>> 0000001e
>>>>>>> 
>>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>>>>>> hex, little endian (least significant byte first), in place of the 0x10 
>>>>>>> that is there now.  It should dump like this:
>>>>>>> 
>>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>>>>>> {
>>>>>>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
>>>>>>>   "creating_pgs": [],
>>>>>>>   "queue": [],
>>>>>>>   "created_pools": [
>>>>>>>       66
>>>>>>>   ]
>>>>>>> }
>>>>>>> 
>>>>>>> sage
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <s...@newdream.net> wrote:
>>>>>>>>> 
>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>> Sage,
>>>>>>>>>> 
>>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created 
>>>>>>>>>> months ago.
>>>>>>>>>> ceph osd lspools
>>>>>>>>>> 66 mypool
>>>>>>>>>> 
>>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So 
>>>>>>>>>> we deleted them.
>>>>>>>>>> After we create another cephfs fs and pools we started MDS and it 
>>>>>>>>>> stucked on creation. So we stopped MDS and removed fs and fs pools. 
>>>>>>>>>> Right now we do not have MDS running nor we have cephfs related 
>>>>>>>>>> things.
>>>>>>>>>> 
>>>>>>>>>> ceph fs dump
>>>>>>>>>> dumped fsmap epoch 1 e1
>>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
>>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
>>>>>>>>>> writeable ranges,3=default file layouts on dirs,4=dir inode in 
>>>>>>>>>> separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
>>>>>>>>>> omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>>>>>>>>> legacy client fscid: -1
>>>>>>>>>> 
>>>>>>>>>> No filesystems configured
>>>>>>>>>> 
>>>>>>>>>> ceph fs ls
>>>>>>>>>> No filesystems enabled
>>>>>>>>>> 
>>>>>>>>>> Now pool 66 seems to only pool we have and it has been created 
>>>>>>>>>> months ago. Then I guess there is something hidden out there.
>>>>>>>>>> 
>>>>>>>>>> Is there any way to find and delete it?
>>>>>>>>> 
>>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>>>>>>>> pool... did you perhaps run force_create_pg at some point?  Assuming 
>>>>>>>>> you 
>>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
>>>>>>>>> store.. one that doesn't normally come up because the impact is this 
>>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap 
>>>>>>>>> for 
>>>>>>>>> your aged cluster.
>>>>>>>>> 
>>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a 
>>>>>>>>> bit 
>>>>>>>>> dicey. 
>>>>>>>>> 
>>>>>>>>> 1. stop all mons
>>>>>>>>> 2. make a backup of all mons
>>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
>>>>>>>>> key=creating key on one of the mons
>>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE 
>>>>>>>>> dump_json
>>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds 
>>>>>>>>> pool 
>>>>>>>>> 66 to the created_pgs member.
>>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
>>>>>>>>> 7. inject the updated structure into all of the mons
>>>>>>>>> 8. start all mons
>>>>>>>>> 
>>>>>>>>> 4-6 will probably be an iterative process... let's start by getting 
>>>>>>>>> the 
>>>>>>>>> structure out and dumping the current value?  
>>>>>>>>> 
>>>>>>>>> The code to refer to to understand the structure is 
>>>>>>>>> src/mon/CreatingPGs.h 
>>>>>>>>> encode/decode methods.
>>>>>>>>> 
>>>>>>>>> sage
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <s...@newdream.net> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Oh... I think this is the problem:
>>>>>>>>>>> 
>>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op 
>>>>>>>>>>> osd_pg_create(e72883 
>>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 
>>>>>>>>>>> 66.124:60196 
>>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>>>>>>>>>> 
>>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg 
>>>>>>>>>>> create is 
>>>>>>>>>>> one of the last remaining places where the OSDs need to look at a 
>>>>>>>>>>> full 
>>>>>>>>>>> history of map changes between then and the current map epoch.  In 
>>>>>>>>>>> this 
>>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k 
>>>>>>>>>>> epochs 
>>>>>>>>>>> later.
>>>>>>>>>>> 
>>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we 
>>>>>>>>>>> delete 
>>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon 
>>>>>>>>>>> end 
>>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
>>>>>>>>>>> 
>>>>>>>>>>> s
>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> It seems nothing has changed.
>>>>>>>>>>>> 
>>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
>>>>>>>>>>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/>
>>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
>>>>>>>>>>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daz...@gmail.com> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your 
>>>>>>>>>>>>> ceph
>>>>>>>>>>>>> config file. "osd crush update on start = false". This should 
>>>>>>>>>>>>> remove
>>>>>>>>>>>>> and update that is generated when osd starts.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log 
>>>>>>>>>>>>> [INF] :
>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' 
>>>>>>>>>>>>> cmd=[{"prefix":
>>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>>>>>>>>>>>> dispatch
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader)
>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush 
>>>>>>>>>>>>> create-or-move",
>>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>>>>>>>>>>>> "root=default"]} v 0) v1
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log 
>>>>>>>>>>>>> [INF] :
>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' 
>>>>>>>>>>>>> cmd=[{"prefix":
>>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item 
>>>>>>>>>>>>> name
>>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
>>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
>>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Sage,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is 
>>>>>>>>>>>>>> going to be resolved.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The problem started when 7 server crashed same time and they 
>>>>>>>>>>>>>> came back after ~5 minutes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since 
>>>>>>>>>>>>>> mon services are enabled they should be started nearly at the 
>>>>>>>>>>>>>> same time. I dont know if this makes any difference but some of 
>>>>>>>>>>>>>> the guys on IRC told it is required that they start in order not 
>>>>>>>>>>>>>> at the same time. Otherwise it could break things badly.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the 
>>>>>>>>>>>>>> end we have so many STUCK request and our cluster can not heal 
>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when 
>>>>>>>>>>>>>> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization 
>>>>>>>>>>>>>> and never ends. CPU and RAM usage are OK.
>>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our 
>>>>>>>>>>>>>> goal was the eliminate older PG's and leaving cluster with good 
>>>>>>>>>>>>>> ones.
>>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it 
>>>>>>>>>>>>>> didn’t solve our problem.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even 
>>>>>>>>>>>>>> saving %50 of our data will be make us very happy right now.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it 
>>>>>>>>>>>>>> explains the whole story very nicely.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is our ceph.conf:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph 
>>>>>>>>>>>>>> health”:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is pg dump:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is iostat & perf top:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This strace output of ceph-osd:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD log (default debug):
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is leader MON log (default debug):
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and 
>>>>>>>>>>>>>> debug_filestore = 20.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and 
>>>>>>>>>>>>>> debug_filestore = 20.
>>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and 
>>>>>>>>>>>>>> exceeds mon epoch somehow??
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 
>>>>>>>>>>>>>> e60196@2018-09-28 23:57:08.251119
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 
>>>>>>>>>>>>>> build_initial_pg_history 66.f8 created 60196
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 
>>>>>>>>>>>>>> 60196 - loading and decoding 0x19da8400
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
>>>>>>>>>>>>>> _process 66.d8 to_process <> waiting <> waiting_peering {}
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
>>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 
>>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 
>>>>>>>>>>>>>> e72642) queued
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
>>>>>>>>>>>>>> _process 66.d8 to_process <OpQueueItem(66.d8 
>>>>>>>>>>>>>> PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt 
>>>>>>>>>>>>>> +create_info) prio 255 cost 10 e72642)> waiting <> 
>>>>>>>>>>>>>> waiting_peering {}
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
>>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 
>>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 
>>>>>>>>>>>>>> e72642) pg 0xb579400
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 
>>>>>>>>>>>>>> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] 
>>>>>>>>>>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 
>>>>>>>>>>>>>> 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 
>>>>>>>>>>>>>> lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 
>>>>>>>>>>>>>> 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 
>>>>>>>>>>>>>> 72642 epoch_requested: 72642 NullEvt +create_info
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 
>>>>>>>>>>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206
>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
>>>>>>>>>>>>>> _process empty q, waiting
>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 
>>>>>>>>>>>>>> 60196 50012 bytes
>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 
>>>>>>>>>>>>>> 60197 - loading and decoding 0x19da8880
>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 
>>>>>>>>>>>>>> 60197 50012 bytes
>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 
>>>>>>>>>>>>>> 60198 - loading and decoding 0x19da9180
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and 
>>>>>>>>>>>>>> you should
>>>>>>>>>>>>>> only use it in very specific circumstances when directed by a 
>>>>>>>>>>>>>> developer.
>>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're not 
>>>>>>>>>>>>>> getting to
>>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all 
>>>>>>>>>>>>>> need to
>>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at 
>>>>>>>>>>>>>> the 'ceph
>>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap 
>>>>>>>>>>>>>> epoch tha
>>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset noup?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I 
>>>>>>>>>>>>>> haven't seen
>>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  
>>>>>>>>>>>>>> Catching up
>>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once 
>>>>>>>>>>>>>> they were all
>>>>>>>>>>>>>> marked up...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>>>>>>>>>>>>>> Is that be usefull here? There is such a less information about 
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı (2 Eki 
>>>>>>>>>>>>>> 2018 22:11):
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when 
>>>>>>>>>>>>>> I read somewhere that that will do the proper sizing.
>>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will 
>>>>>>>>>>>>>> increase 8GB for each and decide a more calculated number    
>>>>>>>>>>>>>> after cluster settles.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in 
>>>>>>>>>>>>>> loop. For that reason I started cluster with noup flag and 
>>>>>>>>>>>>>> waited OSDs to reach the uptodate epoch number. After that I 
>>>>>>>>>>>>>> unset noup. But I did not pay attention to manager logs. Let me 
>>>>>>>>>>>>>> check it, thank you!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very 
>>>>>>>>>>>>>> standard installation and no tweaks or tunings. All we ask for 
>>>>>>>>>>>>>> the stability versus speed from the begining. And here we are :/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daz...@gmail.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can see some issues from the osd log file. You have an 
>>>>>>>>>>>>>> extremely low
>>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I 
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If 
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. 
>>>>>>>>>>>>>> Default
>>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You 
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you 
>>>>>>>>>>>>>> sure all
>>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
>>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after 
>>>>>>>>>>>>>> upgrading to
>>>>>>>>>>>>>> mimic.
>>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>>>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello Darius,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for reply!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” 
>>>>>>>>>>>>>> does stucks and wait forever since OSD is unresponsive.
>>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And 
>>>>>>>>>>>>>> we are certain that OSD responds again after its disk 
>>>>>>>>>>>>>> utilization stops.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So we have a small test like that:
>>>>>>>>>>>>>> * Stop all OSDs (168 of them)
>>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It 
>>>>>>>>>>>>>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” 
>>>>>>>>>>>>>> works!
>>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & 
>>>>>>>>>>>>>> OSD2 starts %95 disk utilization. This takes 17 minutes to 
>>>>>>>>>>>>>> finish.
>>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and 
>>>>>>>>>>>>>> it takes 25 mins to settle.
>>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs 
>>>>>>>>>>>>>> start high I/O again. And it takes 1 hour to finish.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all 
>>>>>>>>>>>>>> OSDs. At first there was no I/O. After 10 minutes we unset noup. 
>>>>>>>>>>>>>> All of 168 OSD started to make high I/O. And we thought that if 
>>>>>>>>>>>>>> we wait long enough it will finish & OSDs will be responsive 
>>>>>>>>>>>>>> again. After 24hours they did not because I/O did not finish or 
>>>>>>>>>>>>>> even slowed down.
>>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is 
>>>>>>>>>>>>>> just 33TB.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the 
>>>>>>>>>>>>>> same PG from two different OSDs. One was 4.2GB and the other is 
>>>>>>>>>>>>>> 500KB! So we decided to export all OSDs for backup. Then we will 
>>>>>>>>>>>>>> delete strange sized ones and start the cluster all over. Maybe 
>>>>>>>>>>>>>> then we could solve the stucked or unfound PGs as you advise.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Any thought would be greatly appreciated.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daz...@gmail.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding 
>>>>>>>>>>>>>> them
>>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are 
>>>>>>>>>>>>>> failing to
>>>>>>>>>>>>>> start and concentrate on bringing online those that have missing
>>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster 
>>>>>>>>>>>>>> and see
>>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not 
>>>>>>>>>>>>>> turn
>>>>>>>>>>>>>> them back on.
>>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to 
>>>>>>>>>>>>>> check
>>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible 
>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to 
>>>>>>>>>>>>>> fix
>>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>>>>>>>>>>>>>> <morphinwith...@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview 
>>>>>>>>>>>>>> tech
>>>>>>>>>>>>>> (as for Redhat).
>>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are 
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> tools to check health and also repair. But there are limited
>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>> Anyone who has experince with it?
>>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
>>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018 Pzt, 
>>>>>>>>>>>>>> 22:55
>>>>>>>>>>>>>> tarihinde şunu yazdı:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 
>>>>>>>>>>>>>> days since this cluster is offline. The long short story is 
>>>>>>>>>>>>>> here: 
>>>>>>>>>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If 
>>>>>>>>>>>>>> this is not to correct way I apology and please let me know.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some 
>>>>>>>>>>>>>> outcomes about the problem. Some of them was misjudged and some 
>>>>>>>>>>>>>> of them are not looked deeper.
>>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very 
>>>>>>>>>>>>>> high disk I/O to its bluestore disk (WAL and DB are fine). After 
>>>>>>>>>>>>>> that OSDs become unresponsive or very very less responsive. For 
>>>>>>>>>>>>>> example "ceph tell osd.x version” stucks like for ever.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our 
>>>>>>>>>>>>>> problem!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the 
>>>>>>>>>>>>>> reason.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the latest ceph status:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago.
>>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any 
>>>>>>>>>>>>>> difference when all of them was online.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont 
>>>>>>>>>>>>>> caught from eye.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This the full log the same OSD.
>>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is the strace of the same OSD process:
>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as 
>>>>>>>>>>>>>> we do. I am sure we have done lots of mistakes to let this 
>>>>>>>>>>>>>> happening. And this situation may be a example for other user 
>>>>>>>>>>>>>> and could be a potential bug for ceph developer.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Any help to figure out what is going on would be great.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>> Goktug Yildirim
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>> 
>>>>>> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic offline problem

Reply via email to