I bet the kvstore output it in a hexdump format? There is another option to get the raw data iirc
On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildi...@gmail.com> wrote: >I changed the file name to make it clear. >When I use your command with "+decode" I'm getting an error like this: > >ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json >error: buffer::malformed_input: void >creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer >understand >old encoding version 2 < 111 > >My ceph version: 13.2.2 > >3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <s...@newdream.net> şunu >yazdı: > >> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: >> > If I didn't do it wrong, I got the output as below. >> > >> > ceph-kvstore-tool rocksdb >/var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ >> get osd_pg_creating creating > dump >> > 2018-10-03 20:08:52.070 7f07f5659b80 1 rocksdb: do_open column >> families: [default] >> > >> > ceph-dencoder type creating_pgs_t import dump dump_json >> >> Sorry, should be >> >> ceph-dencoder type creating_pgs_t import dump decode dump_json >> >> s >> >> > { >> > "last_scan_epoch": 0, >> > "creating_pgs": [], >> > "queue": [], >> > "created_pools": [] >> > } >> > >> > You can find the "dump" link below. >> > >> > dump: >> >https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing >> > >> > >> > Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 18:45): >> > >> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >> > >> We are starting to work on it. First step is getting the >structure >> out and dumping the current value as you say. >> > >> >> > >> And you were correct we did not run force_create_pg. >> > > >> > > Great. >> > > >> > > So, eager to see what the current structure is... please attach >once >> you >> > > have it. >> > > >> > > The new replacement one should look like this (when hexdump >-C'd): >> > > >> > > 00000000 02 01 18 00 00 00 10 00 00 00 00 00 00 00 01 00 >> |................| >> > > 00000010 00 00 42 00 00 00 00 00 00 00 00 00 00 00 >> |..B...........| >> > > 0000001e >> > > >> > > ...except that from byte 6 you want to put in a recent OSDMap >epoch, >> in >> > > hex, little endian (least significant byte first), in place of >the >> 0x10 >> > > that is there now. It should dump like this: >> > > >> > > $ ceph-dencoder type creating_pgs_t import myfile decode >dump_json >> > > { >> > > "last_scan_epoch": 16, <--- but with a recent epoch here >> > > "creating_pgs": [], >> > > "queue": [], >> > > "created_pools": [ >> > > 66 >> > > ] >> > > } >> > > >> > > sage >> > > >> > > >> > >> >> > >>> On 3 Oct 2018, at 17:52, Sage Weil <s...@newdream.net> wrote: >> > >>> >> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >> > >>>> Sage, >> > >>>> >> > >>>> Pool 66 is the only pool it shows right now. This a pool >created >> months ago. >> > >>>> ceph osd lspools >> > >>>> 66 mypool >> > >>>> >> > >>>> As we recreated mon db from OSDs, the pools for MDS was >unusable. >> So we deleted them. >> > >>>> After we create another cephfs fs and pools we started MDS and >it >> stucked on creation. So we stopped MDS and removed fs and fs pools. >Right >> now we do not have MDS running nor we have cephfs related things. >> > >>>> >> > >>>> ceph fs dump >> > >>>> dumped fsmap epoch 1 e1 >> > >>>> enable_multiple, ever_enabled_multiple: 0,0 >> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client >> writeable ranges,3=default file layouts on dirs,4=dir inode in >separate >> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no >> anchor table,9=file layout v2,10=snaprealm v2} >> > >>>> legacy client fscid: -1 >> > >>>> >> > >>>> No filesystems configured >> > >>>> >> > >>>> ceph fs ls >> > >>>> No filesystems enabled >> > >>>> >> > >>>> Now pool 66 seems to only pool we have and it has been created >> months ago. Then I guess there is something hidden out there. >> > >>>> >> > >>>> Is there any way to find and delete it? >> > >>> >> > >>> Ok, I'm concerned that the creating pg is in there if this is >an old >> > >>> pool... did you perhaps run force_create_pg at some point? >Assuming >> you >> > >>> didn't, I think this is a bug in the process for rebuilding the >mon >> > >>> store.. one that doesn't normally come up because the impact is >this >> > >>> osdmap scan that is cheap in our test scenarios but clearly not >> cheap for >> > >>> your aged cluster. >> > >>> >> > >>> In any case, there is a way to clear those out of the mon, but >it's >> a bit >> > >>> dicey. >> > >>> >> > >>> 1. stop all mons >> > >>> 2. make a backup of all mons >> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating >> > >>> key=creating key on one of the mons >> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t >import >> FILE dump_json >> > >>> 5. hex edit the structure to remove all of the creating pgs, >and >> adds pool >> > >>> 66 to the created_pgs member. >> > >>> 6. verify with ceph-dencoder dump that the edit was correct... >> > >>> 7. inject the updated structure into all of the mons >> > >>> 8. start all mons >> > >>> >> > >>> 4-6 will probably be an iterative process... let's start by >getting >> the >> > >>> structure out and dumping the current value? >> > >>> >> > >>> The code to refer to to understand the structure is >> src/mon/CreatingPGs.h >> > >>> encode/decode methods. >> > >>> >> > >>> sage >> > >>> >> > >>> >> > >>>> >> > >>>> >> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <s...@newdream.net> wrote: >> > >>>>> >> > >>>>> Oh... I think this is the problem: >> > >>>>> >> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op >> osd_pg_create(e72883 >> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 >> 66.124:60196 >> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 >66.271:60196 >> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 >16:20:01.915916 >> > >>>>> >> > >>>>> You are in the midst of creating new pgs, and unfortunately >pg >> create is >> > >>>>> one of the last remaining places where the OSDs need to look >at a >> full >> > >>>>> history of map changes between then and the current map >epoch. In >> this >> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k >> epochs >> > >>>>> later. >> > >>>>> >> > >>>>> What is this new pool for? Is it still empty, and if so, can >we >> delete >> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at >the >> mon end >> > >>>>> and restarting the OSDs will make these pg_creates go away. >> > >>>>> >> > >>>>> s >> > >>>>> >> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >> > >>>>>> >> > >>>>>> Hello, >> > >>>>>> >> > >>>>>> It seems nothing has changed. >> > >>>>>> >> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ < >> https://paste.ubuntu.com/p/MtvTr5HYW4/> >> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ < >> https://paste.ubuntu.com/p/7Sx64xGzkR/> >> > >>>>>> >> > >>>>>> >> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius ><daz...@gmail.com> >> wrote: >> > >>>>>>> >> > >>>>>>> Hello, >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> You can also reduce the osd map updates by adding this to >your >> ceph >> > >>>>>>> config file. "osd crush update on start = false". This >should >> remove >> > >>>>>>> and update that is generated when osd starts. >> > >>>>>>> >> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 mon.SRV-SBKUARK14@0 >> (leader) >> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush >> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1 >> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 log_channel(audit) >log >> [INF] : >> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >> cmd=[{"prefix": >> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": >["47"]}]: >> > >>>>>>> dispatch >> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 mon.SRV-SBKUARK14@0 >> (leader) >> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush >> create-or-move", >> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8", >> > >>>>>>> "root=default"]} v 0) v1 >> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 log_channel(audit) >log >> [INF] : >> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >> cmd=[{"prefix": >> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, >"args": >> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch >> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 >> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush >> item name >> > >>>>>>> 'osd.47' initial_weight 3.6396 at location >> > >>>>>>> {host=SRV-SEKUARK8,root=default} >> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700 1 >> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full >> prune >> > >>>>>>> enabled >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim >> > >>>>>>> <goktug.yildi...@gmail.com> wrote: >> > >>>>>>>> >> > >>>>>>>> Hi Sage, >> > >>>>>>>> >> > >>>>>>>> Thank you for your response. Now I am sure this incident >is >> going to be resolved. >> > >>>>>>>> >> > >>>>>>>> The problem started when 7 server crashed same time and >they >> came back after ~5 minutes. >> > >>>>>>>> >> > >>>>>>>> Two of our 3 mon services were restarted in this crash. >Since >> mon services are enabled they should be started nearly at the same >time. I >> dont know if this makes any difference but some of the guys on IRC >told it >> is required that they start in order not at the same time. Otherwise >it >> could break things badly. >> > >>>>>>>> >> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But >in the >> end we have so many STUCK request and our cluster can not heal >itself. >> > >>>>>>>> >> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But >when >> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour. >> > >>>>>>>> I/O load on all of my OSD disks are at around %95 >utilization >> and never ends. CPU and RAM usage are OK. >> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 >query”. >> > >>>>>>>> >> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. >Our >> goal was the eliminate older PG's and leaving cluster with good ones. >> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But >it >> didn’t solve our problem. >> > >>>>>>>> >> > >>>>>>>> Of course we have to save %100 of data. But we feel like >even >> saving %50 of our data will be make us very happy right now. >> > >>>>>>>> >> > >>>>>>>> This is what happens when the cluster starts. I believe it >> explains the whole story very nicely. >> > >>>>>>>> >> >https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing >> > >>>>>>>> >> > >>>>>>>> This is our ceph.conf: >> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/ >> > >>>>>>>> >> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s >&& >> ceph health”: >> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/ >> > >>>>>>>> >> > >>>>>>>> This is pg dump: >> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/ >> > >>>>>>>> >> > >>>>>>>> This is iostat & perf top: >> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/ >> > >>>>>>>> >> > >>>>>>>> This strace output of ceph-osd: >> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/ >> > >>>>>>>> >> > >>>>>>>> This is OSD log (default debug): >> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/ >> > >>>>>>>> >> > >>>>>>>> This is leader MON log (default debug): >> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/ >> > >>>>>>>> >> > >>>>>>>> These are OSDs failed to start. Total number is 58. >> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/ >> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/ >> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/ >> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/ >> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/ >> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/ >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and >> debug_filestore = 20. >> > >>>>>>>> >> >https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing >> > >>>>>>>> >> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 >and >> debug_filestore = 20. >> > >>>>>>>> >> >https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing >> > >>>>>>>> >> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch >and >> exceeds mon epoch somehow?? >> > >>>>>>>> >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg >> 66.f8 e60196@2018-09-28 23:57:08.251119 >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 >> build_initial_pg_history 66.f8 created 60196 >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 >get_map >> 60196 - loading and decoding 0x19da8400 >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >> _process 66.d8 to_process <> waiting <> waiting_peering {} >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) >queued >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >> _process 66.d8 to_process <OpQueueItem(66.d8 >PGPeeringEvent(epoch_sent: >> 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 >> e72642)> waiting <> waiting_peering {} >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642) >pg >> 0xb579400 >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: >72642 >> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] >> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206 >les/c/f >> 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642 >pi=[72206,72642)/1 >> crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]] >> do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt >> +create_info >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 >> queue_want_up_thru want 72642 <= queued 72642, currently 72206 >> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >> _process empty q, waiting >> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 >> add_map_bl 60196 50012 bytes >> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 >get_map >> 60197 - loading and decoding 0x19da8880 >> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 >> add_map_bl 60197 50012 bytes >> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 >get_map >> 60198 - loading and decoding 0x19da9180 >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net> >wrote: >> > >>>>>>>> >> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous >option and >> you should >> > >>>>>>>> only use it in very specific circumstances when directed >by a >> developer. >> > >>>>>>>> In such cases it will allow a stuck PG to peer. But >you're not >> getting to >> > >>>>>>>> that point...you're seeing some sort of resource >exhaustion. >> > >>>>>>>> >> > >>>>>>>> The noup trick works when OSDs are way behind on maps and >all >> need to >> > >>>>>>>> catch up. The way to tell if they are behind is by >looking at >> the 'ceph >> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest >> OSDMap epoch tha >> > >>>>>>>> t the mons have. Were they really caught up when you >unset >> noup? >> > >>>>>>>> >> > >>>>>>>> I'm just catching up and haven't read the whole thread but >I >> haven't seen >> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk >IO. >> Catching up >> > >>>>>>>> on maps could explain it but not why they wouldn't peer >once >> they were all >> > >>>>>>>> marked up... >> > >>>>>>>> >> > >>>>>>>> sage >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote: >> > >>>>>>>> >> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = >true >> ? >> > >>>>>>>> Is that be usefull here? There is such a less information >about >> it. >> > >>>>>>>> >> > >>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı >(2 >> Eki 2018 22:11): >> > >>>>>>>> >> > >>>>>>>> Hi, >> > >>>>>>>> >> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db >partitions >> when I read somewhere that that will do the proper sizing. >> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will >> increase 8GB for each and decide a more calculated number after >cluster >> settles. >> > >>>>>>>> >> > >>>>>>>> For the osd map loading I’ve also figured it out. And it >is in >> loop. For that reason I started cluster with noup flag and waited >OSDs to >> reach the uptodate epoch number. After that I unset noup. But I did >not pay >> attention to manager logs. Let me check it, thank you! >> > >>>>>>>> >> > >>>>>>>> I am not forcing jmellac or anything else really. I have a >very >> standard installation and no tweaks or tunings. All we ask for the >> stability versus speed from the begining. And here we are :/ >> > >>>>>>>> >> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius ><daz...@gmail.com> >> wrote: >> > >>>>>>>> >> > >>>>>>>> Hi, >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> I can see some issues from the osd log file. You have an >> extremely low >> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for >wal. >> I would >> > >>>>>>>> recommend cranking up rocksdb cache size as much as >possible. >> If you >> > >>>>>>>> have RAM you can also increase bluestores cache size for >hdd. >> Default >> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. >You >> also >> > >>>>>>>> have lots of osd map loading and decoding in the log. Are >you >> sure all >> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you >aren't >> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after >> upgrading to >> > >>>>>>>> mimic. >> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim >> > >>>>>>>> <goktug.yildi...@gmail.com> wrote: >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Hello Darius, >> > >>>>>>>> >> > >>>>>>>> Thanks for reply! >> > >>>>>>>> >> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f >> query” does stucks and wait forever since OSD is unresponsive. >> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it >UP. And >> we are certain that OSD responds again after its disk utilization >stops. >> > >>>>>>>> >> > >>>>>>>> So we have a small test like that: >> > >>>>>>>> * Stop all OSDs (168 of them) >> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. >It >> takes 8 mins to finish. Only after that “ceph pg 67.54f query” works! >> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts >OSD1 & >> OSD2 starts %95 disk utilization. This takes 17 minutes to finish. >> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high >I/O >> and it takes 25 mins to settle. >> > >>>>>>>> * If you happen to start 5 of them at the same all of the >OSDs >> start high I/O again. And it takes 1 hour to finish. >> > >>>>>>>> >> > >>>>>>>> So in the light of these findings we flagged noup, started >all >> OSDs. At first there was no I/O. After 10 minutes we unset noup. All >of 168 >> OSD started to make high I/O. And we thought that if we wait long >enough it >> will finish & OSDs will be responsive again. After 24hours they did >not >> because I/O did not finish or even slowed down. >> > >>>>>>>> One can think that is a lot of data there to scan. But it >is >> just 33TB. >> > >>>>>>>> >> > >>>>>>>> So at short we dont know which PG is stuck so we can >remove it. >> > >>>>>>>> >> > >>>>>>>> However we met an weird thing half an hour ago. We >exported the >> same PG from two different OSDs. One was 4.2GB and the other is >500KB! So >> we decided to export all OSDs for backup. Then we will delete strange >sized >> ones and start the cluster all over. Maybe then we could solve the >stucked >> or unfound PGs as you advise. >> > >>>>>>>> >> > >>>>>>>> Any thought would be greatly appreciated. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius ><daz...@gmail.com> >> wrote: >> > >>>>>>>> >> > >>>>>>>> Hello, >> > >>>>>>>> >> > >>>>>>>> Currently you have 15 objects missing. I would recommend >> finding them >> > >>>>>>>> and making backups of them. Ditch all other osds that are >> failing to >> > >>>>>>>> start and concentrate on bringing online those that have >missing >> > >>>>>>>> objects. Then slowly turn off nodown and noout on the >cluster >> and see >> > >>>>>>>> if it stabilises. If it stabilises leave these setting if >not >> turn >> > >>>>>>>> them back on. >> > >>>>>>>> Now get some of the pg's that are blocked and querry the >pgs to >> check >> > >>>>>>>> why they are blocked. Try removing as much blocks as >possible >> and then >> > >>>>>>>> remove the norebalance/norecovery flags and see if it >starts to >> fix >> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin >> > >>>>>>>> <morphinwith...@gmail.com> wrote: >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat >> preview tech >> > >>>>>>>> (as for Redhat). >> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. >There >> are some >> > >>>>>>>> tools to check health and also repair. But there are >limited >> > >>>>>>>> documentation. >> > >>>>>>>> Anyone who has experince with it? >> > >>>>>>>> Anyone lead/help to a proper check would be great. >> > >>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018 >Pzt, >> 22:55 >> > >>>>>>>> tarihinde şunu yazdı: >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Hi all, >> > >>>>>>>> >> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s >been 6 >> days since this cluster is offline. The long short story is here: >> >http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html >> > >>>>>>>> >> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. >If >> this is not to correct way I apology and please let me know. >> > >>>>>>>> >> > >>>>>>>> For the 6 days lots of thing happened and there were some >> outcomes about the problem. Some of them was misjudged and some of >them are >> not looked deeper. >> > >>>>>>>> However the most certain diagnosis is this: each OSD >causes >> very high disk I/O to its bluestore disk (WAL and DB are fine). After >that >> OSDs become unresponsive or very very less responsive. For example >"ceph >> tell osd.x version” stucks like for ever. >> > >>>>>>>> >> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This >is >> our problem! >> > >>>>>>>> >> > >>>>>>>> This is the one we are very sure of. But we are not sure >of the >> reason. >> > >>>>>>>> >> > >>>>>>>> Here is the latest ceph status: >> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/. >> > >>>>>>>> >> > >>>>>>>> This is the status after we started all of the OSDs 24 >hours >> ago. >> > >>>>>>>> Some of the OSDs are not started. However it didnt make >any >> difference when all of them was online. >> > >>>>>>>> >> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all >others: >> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >> > >>>>>>>> As we figure out there is a loop pattern. I am sure it >wont >> caught from eye. >> > >>>>>>>> >> > >>>>>>>> This the full log the same OSD. >> > >>>>>>>> >https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0 >> > >>>>>>>> >> > >>>>>>>> Here is the strace of the same OSD process: >> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >> > >>>>>>>> >> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get >hurts as >> we do. I am sure we have done lots of mistakes to let this happening. >And >> this situation may be a example for other user and could be a >potential bug >> for ceph developer. >> > >>>>>>>> >> > >>>>>>>> Any help to figure out what is going on would be great. >> > >>>>>>>> >> > >>>>>>>> Best Regards, >> > >>>>>>>> Goktug Yildirim >> > >>>>>>>> >> > >>>>>>>> _______________________________________________ >> > >>>>>>>> ceph-users mailing list >> > >>>>>>>> ceph-users@lists.ceph.com >> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> >> > >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com