This is our cluster state right now. I can reach rbd list and thats good! Thanks a lot Sage!!! ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/
As you can see we have 2 unfound pg since some of our OSDs can not start. 58 OSD gives different errors. How can I fix these OSD's? If I remember correctly it should not be so much trouble. These are OSDs' failed logs. https://paste.ubuntu.com/p/ZfRD5ZtvpS/ https://paste.ubuntu.com/p/pkRdVjCH4D/ https://paste.ubuntu.com/p/zJTf2fzSj9/ https://paste.ubuntu.com/p/xpJRK6YhRX/ https://paste.ubuntu.com/p/SY3576dNbJ/ https://paste.ubuntu.com/p/smyT6Y976b/ > On 3 Oct 2018, at 21:37, Sage Weil <s...@newdream.net> wrote: > > On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: >> I'm so sorry about that I missed "out" parameter. My bad.. >> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/ > > Excellent, thanks. That looks like it confirms the problem is that teh > recovery tool didn't repopulate the creating pgs properly. > > If you take that 30 byte file I sent earlier (as hex) and update the > osdmap epoch to the latest on the mon, confirm it decodes and dumps > properly, and then inject it on the 3 mons, that should get you past this > hump (and hopefully back up!). > > sage > > >> >> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 21:13): >> >>> I bet the kvstore output it in a hexdump format? There is another option >>> to get the raw data iirc >>> >>> >>> >>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM >>>> <goktug.yildi...@gmail.com> wrote: >>>> I changed the file name to make it clear. >>>> When I use your command with "+decode" I'm getting an error like this: >>>> >>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json >>>> error: buffer::malformed_input: void >>>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand >>>> old encoding version 2 < 111 >>>> >>>> My ceph version: 13.2.2 >>>> >>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <s...@newdream.net> şunu >>>> yazdı: >>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: >>>>>> If I didn't do it wrong, I got the output as below. >>>>>> >>>>>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ >>>>>> get osd_pg_creating creating > dump >>>>>> 2018-10-03 20:08:52.070 7f07f5659b80 1 rocksdb: do_open column >>>>>> families: [default] >>>>>> >>>>>> ceph-dencoder type creating_pgs_t import dump dump_json >>>>> >>>>> Sorry, should be >>>>> >>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json >>>>> >>>>> s >>>>> >>>>>> { >>>>>> "last_scan_epoch": 0, >>>>>> "creating_pgs": [], >>>>>> "queue": [], >>>>>> "created_pools": [] >>>>>> } >>>>>> >>>>>> You can find the "dump" link below. >>>>>> >>>>>> dump: >>>>>> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing >>>>>> >>>>>> >>>>>> Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 18:45): >>>>>> >>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>> We are starting to work on it. First step is getting the structure out >>>>>>>> and dumping the current value as you say. >>>>>>>> >>>>>>>> And you were correct we did not run force_create_pg. >>>>>>> >>>>>>> Great. >>>>>>> >>>>>>> So, eager to see what the current structure is... please attach once >>>>>>> you >>>>>>> have it. >>>>>>> >>>>>>> The new replacement one should look like this (when hexdump -C'd): >>>>>>> >>>>>>> 00000000 02 01 18 00 00 00 10 00 00 00 00 00 00 00 01 00 >>>>>>> |................| >>>>>>> 00000010 00 00 42 00 00 00 00 00 00 00 00 00 00 00 >>>>>>> |..B...........| >>>>>>> 0000001e >>>>>>> >>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in >>>>>>> hex, little endian (least significant byte first), in place of the 0x10 >>>>>>> that is there now. It should dump like this: >>>>>>> >>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json >>>>>>> { >>>>>>> "last_scan_epoch": 16, <--- but with a recent epoch here >>>>>>> "creating_pgs": [], >>>>>>> "queue": [], >>>>>>> "created_pools": [ >>>>>>> 66 >>>>>>> ] >>>>>>> } >>>>>>> >>>>>>> sage >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <s...@newdream.net> wrote: >>>>>>>>> >>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>>>> Sage, >>>>>>>>>> >>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created >>>>>>>>>> months ago. >>>>>>>>>> ceph osd lspools >>>>>>>>>> 66 mypool >>>>>>>>>> >>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. So >>>>>>>>>> we deleted them. >>>>>>>>>> After we create another cephfs fs and pools we started MDS and it >>>>>>>>>> stucked on creation. So we stopped MDS and removed fs and fs pools. >>>>>>>>>> Right now we do not have MDS running nor we have cephfs related >>>>>>>>>> things. >>>>>>>>>> >>>>>>>>>> ceph fs dump >>>>>>>>>> dumped fsmap epoch 1 e1 >>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0 >>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client >>>>>>>>>> writeable ranges,3=default file layouts on dirs,4=dir inode in >>>>>>>>>> separate object,5=mds uses versioned encoding,6=dirfrag is stored in >>>>>>>>>> omap,8=no anchor table,9=file layout v2,10=snaprealm v2} >>>>>>>>>> legacy client fscid: -1 >>>>>>>>>> >>>>>>>>>> No filesystems configured >>>>>>>>>> >>>>>>>>>> ceph fs ls >>>>>>>>>> No filesystems enabled >>>>>>>>>> >>>>>>>>>> Now pool 66 seems to only pool we have and it has been created >>>>>>>>>> months ago. Then I guess there is something hidden out there. >>>>>>>>>> >>>>>>>>>> Is there any way to find and delete it? >>>>>>>>> >>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an old >>>>>>>>> pool... did you perhaps run force_create_pg at some point? Assuming >>>>>>>>> you >>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon >>>>>>>>> store.. one that doesn't normally come up because the impact is this >>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not cheap >>>>>>>>> for >>>>>>>>> your aged cluster. >>>>>>>>> >>>>>>>>> In any case, there is a way to clear those out of the mon, but it's a >>>>>>>>> bit >>>>>>>>> dicey. >>>>>>>>> >>>>>>>>> 1. stop all mons >>>>>>>>> 2. make a backup of all mons >>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating >>>>>>>>> key=creating key on one of the mons >>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE >>>>>>>>> dump_json >>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and adds >>>>>>>>> pool >>>>>>>>> 66 to the created_pgs member. >>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct... >>>>>>>>> 7. inject the updated structure into all of the mons >>>>>>>>> 8. start all mons >>>>>>>>> >>>>>>>>> 4-6 will probably be an iterative process... let's start by getting >>>>>>>>> the >>>>>>>>> structure out and dumping the current value? >>>>>>>>> >>>>>>>>> The code to refer to to understand the structure is >>>>>>>>> src/mon/CreatingPGs.h >>>>>>>>> encode/decode methods. >>>>>>>>> >>>>>>>>> sage >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <s...@newdream.net> wrote: >>>>>>>>>>> >>>>>>>>>>> Oh... I think this is the problem: >>>>>>>>>>> >>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op >>>>>>>>>>> osd_pg_create(e72883 >>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 >>>>>>>>>>> 66.124:60196 >>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 >>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916 >>>>>>>>>>> >>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg >>>>>>>>>>> create is >>>>>>>>>>> one of the last remaining places where the OSDs need to look at a >>>>>>>>>>> full >>>>>>>>>>> history of map changes between then and the current map epoch. In >>>>>>>>>>> this >>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k >>>>>>>>>>> epochs >>>>>>>>>>> later. >>>>>>>>>>> >>>>>>>>>>> What is this new pool for? Is it still empty, and if so, can we >>>>>>>>>>> delete >>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon >>>>>>>>>>> end >>>>>>>>>>> and restarting the OSDs will make these pg_creates go away. >>>>>>>>>>> >>>>>>>>>>> s >>>>>>>>>>> >>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> It seems nothing has changed. >>>>>>>>>>>> >>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ >>>>>>>>>>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/> >>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ >>>>>>>>>>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daz...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your >>>>>>>>>>>>> ceph >>>>>>>>>>>>> config file. "osd crush update on start = false". This should >>>>>>>>>>>>> remove >>>>>>>>>>>>> and update that is generated when osd starts. >>>>>>>>>>>>> >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader) >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush >>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1 >>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 log_channel(audit) log >>>>>>>>>>>>> [INF] : >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >>>>>>>>>>>>> cmd=[{"prefix": >>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]: >>>>>>>>>>>>> dispatch >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader) >>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush >>>>>>>>>>>>> create-or-move", >>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8", >>>>>>>>>>>>> "root=default"]} v 0) v1 >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 log_channel(audit) log >>>>>>>>>>>>> [INF] : >>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >>>>>>>>>>>>> cmd=[{"prefix": >>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args": >>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch >>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item >>>>>>>>>>>>> name >>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location >>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default} >>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700 1 >>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune >>>>>>>>>>>>> enabled >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim >>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Sage, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is >>>>>>>>>>>>>> going to be resolved. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The problem started when 7 server crashed same time and they >>>>>>>>>>>>>> came back after ~5 minutes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since >>>>>>>>>>>>>> mon services are enabled they should be started nearly at the >>>>>>>>>>>>>> same time. I dont know if this makes any difference but some of >>>>>>>>>>>>>> the guys on IRC told it is required that they start in order not >>>>>>>>>>>>>> at the same time. Otherwise it could break things badly. >>>>>>>>>>>>>> >>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the >>>>>>>>>>>>>> end we have so many STUCK request and our cluster can not heal >>>>>>>>>>>>>> itself. >>>>>>>>>>>>>> >>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But when >>>>>>>>>>>>>> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour. >>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization >>>>>>>>>>>>>> and never ends. CPU and RAM usage are OK. >>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our >>>>>>>>>>>>>> goal was the eliminate older PG's and leaving cluster with good >>>>>>>>>>>>>> ones. >>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it >>>>>>>>>>>>>> didn’t solve our problem. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even >>>>>>>>>>>>>> saving %50 of our data will be make us very happy right now. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it >>>>>>>>>>>>>> explains the whole story very nicely. >>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is our ceph.conf: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph >>>>>>>>>>>>>> health”: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is pg dump: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is iostat & perf top: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This strace output of ceph-osd: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is OSD log (default debug): >>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is leader MON log (default debug): >>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58. >>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/ >>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/ >>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/ >>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/ >>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/ >>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and >>>>>>>>>>>>>> debug_filestore = 20. >>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and >>>>>>>>>>>>>> debug_filestore = 20. >>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing >>>>>>>>>>>>>> >>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and >>>>>>>>>>>>>> exceeds mon epoch somehow?? >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 >>>>>>>>>>>>>> e60196@2018-09-28 23:57:08.251119 >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 >>>>>>>>>>>>>> build_initial_pg_history 66.f8 created 60196 >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>> 60196 - loading and decoding 0x19da8400 >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>> _process 66.d8 to_process <> waiting <> waiting_peering {} >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 >>>>>>>>>>>>>> e72642) queued >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>> _process 66.d8 to_process <OpQueueItem(66.d8 >>>>>>>>>>>>>> PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 NullEvt >>>>>>>>>>>>>> +create_info) prio 255 cost 10 e72642)> waiting <> >>>>>>>>>>>>>> waiting_peering {} >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 >>>>>>>>>>>>>> e72642) pg 0xb579400 >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 >>>>>>>>>>>>>> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] >>>>>>>>>>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c >>>>>>>>>>>>>> 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 >>>>>>>>>>>>>> lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod >>>>>>>>>>>>>> 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: >>>>>>>>>>>>>> 72642 epoch_requested: 72642 NullEvt +create_info >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 >>>>>>>>>>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206 >>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>> _process empty q, waiting >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl >>>>>>>>>>>>>> 60196 50012 bytes >>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>> 60197 - loading and decoding 0x19da8880 >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl >>>>>>>>>>>>>> 60197 50012 bytes >>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>> 60198 - loading and decoding 0x19da9180 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option and >>>>>>>>>>>>>> you should >>>>>>>>>>>>>> only use it in very specific circumstances when directed by a >>>>>>>>>>>>>> developer. >>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer. But you're not >>>>>>>>>>>>>> getting to >>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all >>>>>>>>>>>>>> need to >>>>>>>>>>>>>> catch up. The way to tell if they are behind is by looking at >>>>>>>>>>>>>> the 'ceph >>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap >>>>>>>>>>>>>> epoch tha >>>>>>>>>>>>>> t the mons have. Were they really caught up when you unset noup? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I >>>>>>>>>>>>>> haven't seen >>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO. >>>>>>>>>>>>>> Catching up >>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once >>>>>>>>>>>>>> they were all >>>>>>>>>>>>>> marked up... >>>>>>>>>>>>>> >>>>>>>>>>>>>> sage >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ? >>>>>>>>>>>>>> Is that be usefull here? There is such a less information about >>>>>>>>>>>>>> it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı (2 Eki >>>>>>>>>>>>>> 2018 22:11): >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions when >>>>>>>>>>>>>> I read somewhere that that will do the proper sizing. >>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will >>>>>>>>>>>>>> increase 8GB for each and decide a more calculated number >>>>>>>>>>>>>> after cluster settles. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in >>>>>>>>>>>>>> loop. For that reason I started cluster with noup flag and >>>>>>>>>>>>>> waited OSDs to reach the uptodate epoch number. After that I >>>>>>>>>>>>>> unset noup. But I did not pay attention to manager logs. Let me >>>>>>>>>>>>>> check it, thank you! >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a very >>>>>>>>>>>>>> standard installation and no tweaks or tunings. All we ask for >>>>>>>>>>>>>> the stability versus speed from the begining. And here we are :/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daz...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can see some issues from the osd log file. You have an >>>>>>>>>>>>>> extremely low >>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I >>>>>>>>>>>>>> would >>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. If >>>>>>>>>>>>>> you >>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. >>>>>>>>>>>>>> Default >>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You >>>>>>>>>>>>>> also >>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you >>>>>>>>>>>>>> sure all >>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you aren't >>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after >>>>>>>>>>>>>> upgrading to >>>>>>>>>>>>>> mimic. >>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim >>>>>>>>>>>>>> <goktug.yildi...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello Darius, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for reply! >>>>>>>>>>>>>> >>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” >>>>>>>>>>>>>> does stucks and wait forever since OSD is unresponsive. >>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. And >>>>>>>>>>>>>> we are certain that OSD responds again after its disk >>>>>>>>>>>>>> utilization stops. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So we have a small test like that: >>>>>>>>>>>>>> * Stop all OSDs (168 of them) >>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It >>>>>>>>>>>>>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” >>>>>>>>>>>>>> works! >>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & >>>>>>>>>>>>>> OSD2 starts %95 disk utilization. This takes 17 minutes to >>>>>>>>>>>>>> finish. >>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and >>>>>>>>>>>>>> it takes 25 mins to settle. >>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs >>>>>>>>>>>>>> start high I/O again. And it takes 1 hour to finish. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all >>>>>>>>>>>>>> OSDs. At first there was no I/O. After 10 minutes we unset noup. >>>>>>>>>>>>>> All of 168 OSD started to make high I/O. And we thought that if >>>>>>>>>>>>>> we wait long enough it will finish & OSDs will be responsive >>>>>>>>>>>>>> again. After 24hours they did not because I/O did not finish or >>>>>>>>>>>>>> even slowed down. >>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is >>>>>>>>>>>>>> just 33TB. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported the >>>>>>>>>>>>>> same PG from two different OSDs. One was 4.2GB and the other is >>>>>>>>>>>>>> 500KB! So we decided to export all OSDs for backup. Then we will >>>>>>>>>>>>>> delete strange sized ones and start the cluster all over. Maybe >>>>>>>>>>>>>> then we could solve the stucked or unfound PGs as you advise. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Any thought would be greatly appreciated. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daz...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend finding >>>>>>>>>>>>>> them >>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are >>>>>>>>>>>>>> failing to >>>>>>>>>>>>>> start and concentrate on bringing online those that have missing >>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster >>>>>>>>>>>>>> and see >>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not >>>>>>>>>>>>>> turn >>>>>>>>>>>>>> them back on. >>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs to >>>>>>>>>>>>>> check >>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible >>>>>>>>>>>>>> and then >>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts to >>>>>>>>>>>>>> fix >>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin >>>>>>>>>>>>>> <morphinwith...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat preview >>>>>>>>>>>>>> tech >>>>>>>>>>>>>> (as for Redhat). >>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There are >>>>>>>>>>>>>> some >>>>>>>>>>>>>> tools to check health and also repair. But there are limited >>>>>>>>>>>>>> documentation. >>>>>>>>>>>>>> Anyone who has experince with it? >>>>>>>>>>>>>> Anyone lead/help to a proper check would be great. >>>>>>>>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018 Pzt, >>>>>>>>>>>>>> 22:55 >>>>>>>>>>>>>> tarihinde şunu yazdı: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 >>>>>>>>>>>>>> days since this cluster is offline. The long short story is >>>>>>>>>>>>>> here: >>>>>>>>>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If >>>>>>>>>>>>>> this is not to correct way I apology and please let me know. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some >>>>>>>>>>>>>> outcomes about the problem. Some of them was misjudged and some >>>>>>>>>>>>>> of them are not looked deeper. >>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes very >>>>>>>>>>>>>> high disk I/O to its bluestore disk (WAL and DB are fine). After >>>>>>>>>>>>>> that OSDs become unresponsive or very very less responsive. For >>>>>>>>>>>>>> example "ceph tell osd.x version” stucks like for ever. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is our >>>>>>>>>>>>>> problem! >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of the >>>>>>>>>>>>>> reason. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the latest ceph status: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours ago. >>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any >>>>>>>>>>>>>> difference when all of them was online. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all others: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont >>>>>>>>>>>>>> caught from eye. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This the full log the same OSD. >>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the strace of the same OSD process: >>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts as >>>>>>>>>>>>>> we do. I am sure we have done lots of mistakes to let this >>>>>>>>>>>>>> happening. And this situation may be a example for other user >>>>>>>>>>>>>> and could be a potential bug for ceph developer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Any help to figure out what is going on would be great. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>> Goktug Yildirim >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>> ceph-users@lists.ceph.com >>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>> >>>>>> _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com