On Thu, 4 Oct 2018, Goktug Yildirim wrote:
> This is our cluster state right now. I can reach rbd list and thats good! 
> Thanks a lot Sage!!!
> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/

Progress!  Not out of the woods yet, though...

> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 
> OSD gives different errors.
> How can I fix these OSD's? If I remember correctly it should not be so much 
> trouble.
> 
> These are OSDs' failed logs.
> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> https://paste.ubuntu.com/p/pkRdVjCH4D/

These are both failing in rocksdb code, with something like

Can't access /032949.sst: NotFound:

Can you check whether that .sst file actually exists?  Might be a 
weird path issue.

> https://paste.ubuntu.com/p/zJTf2fzSj9/
> https://paste.ubuntu.com/p/xpJRK6YhRX/

These are failing in the rocksdb CheckConstency code.  Not sure what to 
make of that.

> https://paste.ubuntu.com/p/SY3576dNbJ/
> https://paste.ubuntu.com/p/smyT6Y976b/

These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
here, can you give it a shot?

sage


> 
> > On 3 Oct 2018, at 21:37, Sage Weil <s...@newdream.net> wrote:
> > 
> > On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >> I'm so sorry about that I missed "out" parameter. My bad..
> >> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> > 
> > Excellent, thanks.  That looks like it confirms the problem is that teh 
> > recovery tool didn't repopulate the creating pgs properly.
> > 
> > If you take that 30 byte file I sent earlier (as hex) and update the 
> > osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> > properly, and then inject it on the 3 mons, that should get you past this 
> > hump (and hopefully back up!).
> > 
> > sage
> > 
> > 
> >> 
> >> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 21:13):
> >> 
> >>> I bet the kvstore output it in a hexdump format?  There is another option 
> >>> to get the raw data iirc
> >>> 
> >>> 
> >>> 
> >>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
> >>>> <goktug.yildi...@gmail.com> wrote:
> >>>> I changed the file name to make it clear.
> >>>> When I use your command with "+decode"  I'm getting an error like this:
> >>>> 
> >>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >>>> error: buffer::malformed_input: void 
> >>>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer 
> >>>> understand old encoding version 2 < 111
> >>>> 
> >>>> My ceph version: 13.2.2
> >>>> 
> >>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <s...@newdream.net> şunu 
> >>>> yazdı:
> >>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>> If I didn't do it wrong, I got the output as below.
> >>>>>> 
> >>>>>> ceph-kvstore-tool rocksdb 
> >>>>>> /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating 
> >>>>>> creating > dump
> >>>>>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
> >>>>>> families: [default]
> >>>>>> 
> >>>>>> ceph-dencoder type creating_pgs_t import dump dump_json
> >>>>> 
> >>>>> Sorry, should be
> >>>>> 
> >>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>>>> 
> >>>>> s
> >>>>> 
> >>>>>> {
> >>>>>>    "last_scan_epoch": 0,
> >>>>>>    "creating_pgs": [],
> >>>>>>    "queue": [],
> >>>>>>    "created_pools": []
> >>>>>> }
> >>>>>> 
> >>>>>> You can find the "dump" link below.
> >>>>>> 
> >>>>>> dump: 
> >>>>>> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >>>>>> 
> >>>>>> 
> >>>>>> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 18:45):
> >>>>>> 
> >>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>> We are starting to work on it. First step is getting the structure 
> >>>>>>>> out and dumping the current value as you say.
> >>>>>>>> 
> >>>>>>>> And you were correct we did not run force_create_pg.
> >>>>>>> 
> >>>>>>> Great.
> >>>>>>> 
> >>>>>>> So, eager to see what the current structure is... please attach once 
> >>>>>>> you 
> >>>>>>> have it.
> >>>>>>> 
> >>>>>>> The new replacement one should look like this (when hexdump -C'd):
> >>>>>>> 
> >>>>>>> 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
> >>>>>>> |................|
> >>>>>>> 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00        
> >>>>>>> |..B...........|
> >>>>>>> 0000001e
> >>>>>>> 
> >>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, 
> >>>>>>> in 
> >>>>>>> hex, little endian (least significant byte first), in place of the 
> >>>>>>> 0x10 
> >>>>>>> that is there now.  It should dump like this:
> >>>>>>> 
> >>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>>>>>> {
> >>>>>>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
> >>>>>>>   "creating_pgs": [],
> >>>>>>>   "queue": [],
> >>>>>>>   "created_pools": [
> >>>>>>>       66
> >>>>>>>   ]
> >>>>>>> }
> >>>>>>> 
> >>>>>>> sage
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> 
> >>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <s...@newdream.net> wrote:
> >>>>>>>>> 
> >>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>> Sage,
> >>>>>>>>>> 
> >>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created 
> >>>>>>>>>> months ago.
> >>>>>>>>>> ceph osd lspools
> >>>>>>>>>> 66 mypool
> >>>>>>>>>> 
> >>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. 
> >>>>>>>>>> So we deleted them.
> >>>>>>>>>> After we create another cephfs fs and pools we started MDS and it 
> >>>>>>>>>> stucked on creation. So we stopped MDS and removed fs and fs 
> >>>>>>>>>> pools. Right now we do not have MDS running nor we have cephfs 
> >>>>>>>>>> related things.
> >>>>>>>>>> 
> >>>>>>>>>> ceph fs dump
> >>>>>>>>>> dumped fsmap epoch 1 e1
> >>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0
> >>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
> >>>>>>>>>> writeable ranges,3=default file layouts on dirs,4=dir inode in 
> >>>>>>>>>> separate object,5=mds uses versioned encoding,6=dirfrag is stored 
> >>>>>>>>>> in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>>>>>>>>> legacy client fscid: -1
> >>>>>>>>>> 
> >>>>>>>>>> No filesystems configured
> >>>>>>>>>> 
> >>>>>>>>>> ceph fs ls
> >>>>>>>>>> No filesystems enabled
> >>>>>>>>>> 
> >>>>>>>>>> Now pool 66 seems to only pool we have and it has been created 
> >>>>>>>>>> months ago. Then I guess there is something hidden out there.
> >>>>>>>>>> 
> >>>>>>>>>> Is there any way to find and delete it?
> >>>>>>>>> 
> >>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an 
> >>>>>>>>> old 
> >>>>>>>>> pool... did you perhaps run force_create_pg at some point?  
> >>>>>>>>> Assuming you 
> >>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>>>>>>>> store.. one that doesn't normally come up because the impact is 
> >>>>>>>>> this 
> >>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not 
> >>>>>>>>> cheap for 
> >>>>>>>>> your aged cluster.
> >>>>>>>>> 
> >>>>>>>>> In any case, there is a way to clear those out of the mon, but it's 
> >>>>>>>>> a bit 
> >>>>>>>>> dicey. 
> >>>>>>>>> 
> >>>>>>>>> 1. stop all mons
> >>>>>>>>> 2. make a backup of all mons
> >>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>>>>>>>> key=creating key on one of the mons
> >>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import 
> >>>>>>>>> FILE dump_json
> >>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and 
> >>>>>>>>> adds pool 
> >>>>>>>>> 66 to the created_pgs member.
> >>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>>>>>>>> 7. inject the updated structure into all of the mons
> >>>>>>>>> 8. start all mons
> >>>>>>>>> 
> >>>>>>>>> 4-6 will probably be an iterative process... let's start by getting 
> >>>>>>>>> the 
> >>>>>>>>> structure out and dumping the current value?  
> >>>>>>>>> 
> >>>>>>>>> The code to refer to to understand the structure is 
> >>>>>>>>> src/mon/CreatingPGs.h 
> >>>>>>>>> encode/decode methods.
> >>>>>>>>> 
> >>>>>>>>> sage
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <s...@newdream.net> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> Oh... I think this is the problem:
> >>>>>>>>>>> 
> >>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op 
> >>>>>>>>>>> osd_pg_create(e72883 
> >>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 
> >>>>>>>>>>> 66.124:60196 
> >>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>>>>>>>>>> 
> >>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg 
> >>>>>>>>>>> create is 
> >>>>>>>>>>> one of the last remaining places where the OSDs need to look at a 
> >>>>>>>>>>> full 
> >>>>>>>>>>> history of map changes between then and the current map epoch.  
> >>>>>>>>>>> In this 
> >>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k 
> >>>>>>>>>>> epochs 
> >>>>>>>>>>> later.
> >>>>>>>>>>> 
> >>>>>>>>>>> What is this new pool for?  Is it still empty, and if so, can we 
> >>>>>>>>>>> delete 
> >>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the 
> >>>>>>>>>>> mon end 
> >>>>>>>>>>> and restarting the OSDs will make these pg_creates go away.
> >>>>>>>>>>> 
> >>>>>>>>>>> s
> >>>>>>>>>>> 
> >>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> It seems nothing has changed.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
> >>>>>>>>>>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
> >>>>>>>>>>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>>>>>>>>>> 
> >>>>>>>>>>>> 
> >>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius 
> >>>>>>>>>>>>> <daz...@gmail.com> wrote:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your 
> >>>>>>>>>>>>> ceph
> >>>>>>>>>>>>> config file. "osd crush update on start = false". This should 
> >>>>>>>>>>>>> remove
> >>>>>>>>>>>>> and update that is generated when osd starts.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log 
> >>>>>>>>>>>>> [INF] :
> >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' 
> >>>>>>>>>>>>> cmd=[{"prefix":
> >>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>>>>>>>>>> dispatch
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader)
> >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush 
> >>>>>>>>>>>>> create-or-move",
> >>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>>>>>>>>>> "root=default"]} v 0) v1
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log 
> >>>>>>>>>>>>> [INF] :
> >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' 
> >>>>>>>>>>>>> cmd=[{"prefix":
> >>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush 
> >>>>>>>>>>>>> item name
> >>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default}
> >>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
> >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full 
> >>>>>>>>>>>>> prune
> >>>>>>>>>>>>> enabled
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi Sage,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is 
> >>>>>>>>>>>>>> going to be resolved.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The problem started when 7 server crashed same time and they 
> >>>>>>>>>>>>>> came back after ~5 minutes.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since 
> >>>>>>>>>>>>>> mon services are enabled they should be started nearly at the 
> >>>>>>>>>>>>>> same time. I dont know if this makes any difference but some 
> >>>>>>>>>>>>>> of the guys on IRC told it is required that they start in 
> >>>>>>>>>>>>>> order not at the same time. Otherwise it could break things 
> >>>>>>>>>>>>>> badly.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in 
> >>>>>>>>>>>>>> the end we have so many STUCK request and our cluster can not 
> >>>>>>>>>>>>>> heal itself.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But 
> >>>>>>>>>>>>>> when we unset the flag we see so many STUCKS and SLOW OPS in 1 
> >>>>>>>>>>>>>> hour.
> >>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization 
> >>>>>>>>>>>>>> and never ends. CPU and RAM usage are OK.
> >>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our 
> >>>>>>>>>>>>>> goal was the eliminate older PG's and leaving cluster with 
> >>>>>>>>>>>>>> good ones.
> >>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it 
> >>>>>>>>>>>>>> didn’t solve our problem.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even 
> >>>>>>>>>>>>>> saving %50 of our data will be make us very happy right now.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it 
> >>>>>>>>>>>>>> explains the whole story very nicely.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is our ceph.conf:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && 
> >>>>>>>>>>>>>> ceph health”:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is pg dump:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is iostat & perf top:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This strace output of ceph-osd:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD log (default debug):
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is leader MON log (default debug):
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and 
> >>>>>>>>>>>>>> debug_filestore = 20.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and 
> >>>>>>>>>>>>>> debug_filestore = 20.
> >>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and 
> >>>>>>>>>>>>>> exceeds mon epoch somehow??
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 
> >>>>>>>>>>>>>> 66.f8 e60196@2018-09-28 23:57:08.251119
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 
> >>>>>>>>>>>>>> build_initial_pg_history 66.f8 created 60196
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 
> >>>>>>>>>>>>>> 60196 - loading and decoding 0x19da8400
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
> >>>>>>>>>>>>>> _process 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
> >>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 
> >>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 
> >>>>>>>>>>>>>> e72642) queued
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
> >>>>>>>>>>>>>> _process 66.d8 to_process <OpQueueItem(66.d8 
> >>>>>>>>>>>>>> PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
> >>>>>>>>>>>>>> NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> 
> >>>>>>>>>>>>>> waiting_peering {}
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
> >>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 
> >>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 
> >>>>>>>>>>>>>> e72642) pg 0xb579400
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 
> >>>>>>>>>>>>>> 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] 
> >>>>>>>>>>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 
> >>>>>>>>>>>>>> 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 
> >>>>>>>>>>>>>> lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 
> >>>>>>>>>>>>>> 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 
> >>>>>>>>>>>>>> 72642 epoch_requested: 72642 NullEvt +create_info
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 
> >>>>>>>>>>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) 
> >>>>>>>>>>>>>> _process empty q, waiting
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 
> >>>>>>>>>>>>>> add_map_bl 60196 50012 bytes
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 
> >>>>>>>>>>>>>> 60197 - loading and decoding 0x19da8880
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 
> >>>>>>>>>>>>>> add_map_bl 60197 50012 bytes
> >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 
> >>>>>>>>>>>>>> 60198 - loading and decoding 0x19da9180
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option 
> >>>>>>>>>>>>>> and you should
> >>>>>>>>>>>>>> only use it in very specific circumstances when directed by a 
> >>>>>>>>>>>>>> developer.
> >>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer.  But you're 
> >>>>>>>>>>>>>> not getting to
> >>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all 
> >>>>>>>>>>>>>> need to
> >>>>>>>>>>>>>> catch up.  The way to tell if they are behind is by looking at 
> >>>>>>>>>>>>>> the 'ceph
> >>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest 
> >>>>>>>>>>>>>> OSDMap epoch tha
> >>>>>>>>>>>>>> t the mons have.  Were they really caught up when you unset 
> >>>>>>>>>>>>>> noup?
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I 
> >>>>>>>>>>>>>> haven't seen
> >>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO.  
> >>>>>>>>>>>>>> Catching up
> >>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once 
> >>>>>>>>>>>>>> they were all
> >>>>>>>>>>>>>> marked up...
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> sage
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = 
> >>>>>>>>>>>>>> true ?
> >>>>>>>>>>>>>> Is that be usefull here? There is such a less information 
> >>>>>>>>>>>>>> about it.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı (2 
> >>>>>>>>>>>>>> Eki 2018 22:11):
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions 
> >>>>>>>>>>>>>> when I read somewhere that that will do the proper sizing.
> >>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will 
> >>>>>>>>>>>>>> increase 8GB for each and decide a more calculated number    
> >>>>>>>>>>>>>> after cluster settles.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in 
> >>>>>>>>>>>>>> loop. For that reason I started cluster with noup flag and 
> >>>>>>>>>>>>>> waited OSDs to reach the uptodate epoch number. After that I 
> >>>>>>>>>>>>>> unset noup. But I did not pay attention to manager logs. Let 
> >>>>>>>>>>>>>> me check it, thank you!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a 
> >>>>>>>>>>>>>> very standard installation and no tweaks or tunings. All we 
> >>>>>>>>>>>>>> ask for the stability versus speed from the begining. And here 
> >>>>>>>>>>>>>> we are :/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius 
> >>>>>>>>>>>>>> <daz...@gmail.com> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I can see some issues from the osd log file. You have an 
> >>>>>>>>>>>>>> extremely low
> >>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. 
> >>>>>>>>>>>>>> I would
> >>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. 
> >>>>>>>>>>>>>> If you
> >>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. 
> >>>>>>>>>>>>>> Default
> >>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You 
> >>>>>>>>>>>>>> also
> >>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you 
> >>>>>>>>>>>>>> sure all
> >>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you 
> >>>>>>>>>>>>>> aren't
> >>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after 
> >>>>>>>>>>>>>> upgrading to
> >>>>>>>>>>>>>> mimic.
> >>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello Darius,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Thanks for reply!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f 
> >>>>>>>>>>>>>> query” does stucks and wait forever since OSD is unresponsive.
> >>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. 
> >>>>>>>>>>>>>> And we are certain that OSD responds again after its disk 
> >>>>>>>>>>>>>> utilization stops.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So we have a small test like that:
> >>>>>>>>>>>>>> * Stop all OSDs (168 of them)
> >>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It 
> >>>>>>>>>>>>>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” 
> >>>>>>>>>>>>>> works!
> >>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & 
> >>>>>>>>>>>>>> OSD2 starts %95 disk utilization. This takes 17 minutes to 
> >>>>>>>>>>>>>> finish.
> >>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O 
> >>>>>>>>>>>>>> and it takes 25 mins to settle.
> >>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs 
> >>>>>>>>>>>>>> start high I/O again. And it takes 1 hour to finish.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all 
> >>>>>>>>>>>>>> OSDs. At first there was no I/O. After 10 minutes we unset 
> >>>>>>>>>>>>>> noup. All of 168 OSD started to make high I/O. And we thought 
> >>>>>>>>>>>>>> that if we wait long enough it will finish & OSDs will be 
> >>>>>>>>>>>>>> responsive again. After 24hours they did not because I/O did 
> >>>>>>>>>>>>>> not finish or even slowed down.
> >>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is 
> >>>>>>>>>>>>>> just 33TB.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported 
> >>>>>>>>>>>>>> the same PG from two different OSDs. One was 4.2GB and the 
> >>>>>>>>>>>>>> other is 500KB! So we decided to export all OSDs for backup. 
> >>>>>>>>>>>>>> Then we will delete strange sized ones and start the cluster 
> >>>>>>>>>>>>>> all over. Maybe then we could solve the stucked or unfound PGs 
> >>>>>>>>>>>>>> as you advise.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Any thought would be greatly appreciated.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius 
> >>>>>>>>>>>>>> <daz...@gmail.com> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend 
> >>>>>>>>>>>>>> finding them
> >>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are 
> >>>>>>>>>>>>>> failing to
> >>>>>>>>>>>>>> start and concentrate on bringing online those that have 
> >>>>>>>>>>>>>> missing
> >>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster 
> >>>>>>>>>>>>>> and see
> >>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not 
> >>>>>>>>>>>>>> turn
> >>>>>>>>>>>>>> them back on.
> >>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs 
> >>>>>>>>>>>>>> to check
> >>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible 
> >>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts 
> >>>>>>>>>>>>>> to fix
> >>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>>>>>>>>>> <morphinwith...@gmail.com> wrote:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat 
> >>>>>>>>>>>>>> preview tech
> >>>>>>>>>>>>>> (as for Redhat).
> >>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There 
> >>>>>>>>>>>>>> are some
> >>>>>>>>>>>>>> tools to check health and also repair. But there are limited
> >>>>>>>>>>>>>> documentation.
> >>>>>>>>>>>>>> Anyone who has experince with it?
> >>>>>>>>>>>>>> Anyone lead/help to a proper check would be great.
> >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018 Pzt, 
> >>>>>>>>>>>>>> 22:55
> >>>>>>>>>>>>>> tarihinde şunu yazdı:
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 
> >>>>>>>>>>>>>> days since this cluster is offline. The long short story is 
> >>>>>>>>>>>>>> here: 
> >>>>>>>>>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If 
> >>>>>>>>>>>>>> this is not to correct way I apology and please let me know.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some 
> >>>>>>>>>>>>>> outcomes about the problem. Some of them was misjudged and 
> >>>>>>>>>>>>>> some of them are not looked deeper.
> >>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes 
> >>>>>>>>>>>>>> very high disk I/O to its bluestore disk (WAL and DB are 
> >>>>>>>>>>>>>> fine). After that OSDs become unresponsive or very very less 
> >>>>>>>>>>>>>> responsive. For example "ceph tell osd.x version” stucks like 
> >>>>>>>>>>>>>> for ever.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is 
> >>>>>>>>>>>>>> our problem!
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of 
> >>>>>>>>>>>>>> the reason.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the latest ceph status:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours 
> >>>>>>>>>>>>>> ago.
> >>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any 
> >>>>>>>>>>>>>> difference when all of them was online.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all 
> >>>>>>>>>>>>>> others:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont 
> >>>>>>>>>>>>>> caught from eye.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> This the full log the same OSD.
> >>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Here is the strace of the same OSD process:
> >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts 
> >>>>>>>>>>>>>> as we do. I am sure we have done lots of mistakes to let this 
> >>>>>>>>>>>>>> happening. And this situation may be a example for other user 
> >>>>>>>>>>>>>> and could be a potential bug for ceph developer.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Any help to figure out what is going on would be great.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>> Goktug Yildirim
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>>> ceph-users mailing list
> >>>>>>>>>>>>>> ceph-users@lists.ceph.com
> >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>>> 
> >>>>>> 
> >>>>>> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to