Re: [ceph-users] Mimic offline problem

Sage Weil Wed, 03 Oct 2018 12:13:56 -0700

I bet the kvstore output it in a hexdump format?  There is another option to 
get the raw data iirc


On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM <goktug.yildi...@gmail.com> 
wrote:
>I changed the file name to make it clear.
>When I use your command with "+decode"  I'm getting an error like this:
>
>ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>error: buffer::malformed_input: void
>creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer
>understand
>old encoding version 2 < 111
>
>My ceph version: 13.2.2
>
>3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <s...@newdream.net> şunu
>yazdı:
>
>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> > If I didn't do it wrong, I got the output as below.
>> >
>> > ceph-kvstore-tool rocksdb
>/var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/
>> get osd_pg_creating creating > dump
>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column
>> families: [default]
>> >
>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>
>> Sorry, should be
>>
>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>
>> s
>>
>> > {
>> >     "last_scan_epoch": 0,
>> >     "creating_pgs": [],
>> >     "queue": [],
>> >     "created_pools": []
>> > }
>> >
>> > You can find the "dump" link below.
>> >
>> > dump:
>>
>https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>> >
>> >
>> > Sage Weil <s...@newdream.net> şunları yazdı (3 Eki 2018 18:45):
>> >
>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >> We are starting to work on it. First step is getting the
>structure
>> out and dumping the current value as you say.
>> > >>
>> > >> And you were correct we did not run force_create_pg.
>> > >
>> > > Great.
>> > >
>> > > So, eager to see what the current structure is... please attach
>once
>> you
>> > > have it.
>> > >
>> > > The new replacement one should look like this (when hexdump
>-C'd):
>> > >
>> > > 00000000  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00
>> |................|
>> > > 00000010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
>> |..B...........|
>> > > 0000001e
>> > >
>> > > ...except that from byte 6 you want to put in a recent OSDMap
>epoch,
>> in
>> > > hex, little endian (least significant byte first), in place of
>the
>> 0x10
>> > > that is there now.  It should dump like this:
>> > >
>> > > $ ceph-dencoder type creating_pgs_t import myfile decode
>dump_json
>> > > {
>> > >    "last_scan_epoch": 16,   <--- but with a recent epoch here
>> > >    "creating_pgs": [],
>> > >    "queue": [],
>> > >    "created_pools": [
>> > >        66
>> > >    ]
>> > > }
>> > >
>> > > sage
>> > >
>> > >
>> > >>
>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <s...@newdream.net> wrote:
>> > >>>
>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >>>> Sage,
>> > >>>>
>> > >>>> Pool 66 is the only pool it shows right now. This a pool
>created
>> months ago.
>> > >>>> ceph osd lspools
>> > >>>> 66 mypool
>> > >>>>
>> > >>>> As we recreated mon db from OSDs, the pools for MDS was
>unusable.
>> So we deleted them.
>> > >>>> After we create another cephfs fs and pools we started MDS and
>it
>> stucked on creation. So we stopped MDS and removed fs and fs pools.
>Right
>> now we do not have MDS running nor we have cephfs related things.
>> > >>>>
>> > >>>> ceph fs dump
>> > >>>> dumped fsmap epoch 1 e1
>> > >>>> enable_multiple, ever_enabled_multiple: 0,0
>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in
>separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> > >>>> legacy client fscid: -1
>> > >>>>
>> > >>>> No filesystems configured
>> > >>>>
>> > >>>> ceph fs ls
>> > >>>> No filesystems enabled
>> > >>>>
>> > >>>> Now pool 66 seems to only pool we have and it has been created
>> months ago. Then I guess there is something hidden out there.
>> > >>>>
>> > >>>> Is there any way to find and delete it?
>> > >>>
>> > >>> Ok, I'm concerned that the creating pg is in there if this is
>an old
>> > >>> pool... did you perhaps run force_create_pg at some point? 
>Assuming
>> you
>> > >>> didn't, I think this is a bug in the process for rebuilding the
>mon
>> > >>> store.. one that doesn't normally come up because the impact is
>this
>> > >>> osdmap scan that is cheap in our test scenarios but clearly not
>> cheap for
>> > >>> your aged cluster.
>> > >>>
>> > >>> In any case, there is a way to clear those out of the mon, but
>it's
>> a bit
>> > >>> dicey.
>> > >>>
>> > >>> 1. stop all mons
>> > >>> 2. make a backup of all mons
>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
>> > >>> key=creating key on one of the mons
>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t
>import
>> FILE dump_json
>> > >>> 5. hex edit the structure to remove all of the creating pgs,
>and
>> adds pool
>> > >>> 66 to the created_pgs member.
>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
>> > >>> 7. inject the updated structure into all of the mons
>> > >>> 8. start all mons
>> > >>>
>> > >>> 4-6 will probably be an iterative process... let's start by
>getting
>> the
>> > >>> structure out and dumping the current value?
>> > >>>
>> > >>> The code to refer to to understand the structure is
>> src/mon/CreatingPGs.h
>> > >>> encode/decode methods.
>> > >>>
>> > >>> sage
>> > >>>
>> > >>>
>> > >>>>
>> > >>>>
>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <s...@newdream.net> wrote:
>> > >>>>>
>> > >>>>> Oh... I think this is the problem:
>> > >>>>>
>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op
>> osd_pg_create(e72883
>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196
>> 66.124:60196
>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196
>66.271:60196
>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03
>16:20:01.915916
>> > >>>>>
>> > >>>>> You are in the midst of creating new pgs, and unfortunately
>pg
>> create is
>> > >>>>> one of the last remaining places where the OSDs need to look
>at a
>> full
>> > >>>>> history of map changes between then and the current map
>epoch.  In
>> this
>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k
>> epochs
>> > >>>>> later.
>> > >>>>>
>> > >>>>> What is this new pool for?  Is it still empty, and if so, can
>we
>> delete
>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at
>the
>> mon end
>> > >>>>> and restarting the OSDs will make these pg_creates go away.
>> > >>>>>
>> > >>>>> s
>> > >>>>>
>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >>>>>>
>> > >>>>>> Hello,
>> > >>>>>>
>> > >>>>>> It seems nothing has changed.
>> > >>>>>>
>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ <
>> https://paste.ubuntu.com/p/MtvTr5HYW4/>
>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ <
>> https://paste.ubuntu.com/p/7Sx64xGzkR/>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius
><daz...@gmail.com>
>> wrote:
>> > >>>>>>>
>> > >>>>>>> Hello,
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> You can also reduce the osd map updates by adding this to
>your
>> ceph
>> > >>>>>>> config file. "osd crush update on start = false". This
>should
>> remove
>> > >>>>>>> and update that is generated when osd starts.
>> > >>>>>>>
>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0
>> (leader)
>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit)
>log
>> [INF] :
>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
>> cmd=[{"prefix":
>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids":
>["47"]}]:
>> > >>>>>>> dispatch
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0
>> (leader)
>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush
>> create-or-move",
>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>> > >>>>>>> "root=default"]} v 0) v1
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit)
>log
>> [INF] :
>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47'
>> cmd=[{"prefix":
>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396,
>"args":
>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush
>> item name
>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location
>> > >>>>>>> {host=SRV-SEKUARK8,root=default}
>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full
>> prune
>> > >>>>>>> enabled
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>> > >>>>>>> <goktug.yildi...@gmail.com> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi Sage,
>> > >>>>>>>>
>> > >>>>>>>> Thank you for your response. Now I am sure this incident
>is
>> going to be resolved.
>> > >>>>>>>>
>> > >>>>>>>> The problem started when 7 server crashed same time and
>they
>> came back after ~5 minutes.
>> > >>>>>>>>
>> > >>>>>>>> Two of our 3 mon services were restarted in this crash.
>Since
>> mon services are enabled they should be started nearly at the same
>time. I
>> dont know if this makes any difference but some of the guys on IRC
>told it
>> is required that they start in order not at the same time. Otherwise
>it
>> could break things badly.
>> > >>>>>>>>
>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But
>in the
>> end we have so many STUCK request and our cluster can not heal
>itself.
>> > >>>>>>>>
>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But
>when
>> we unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
>> > >>>>>>>> I/O load on all of my OSD disks are at around %95
>utilization
>> and never ends. CPU and RAM usage are OK.
>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0
>query”.
>> > >>>>>>>>
>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1.
>Our
>> goal was the eliminate older PG's and leaving cluster with good ones.
>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But
>it
>> didn’t solve our problem.
>> > >>>>>>>>
>> > >>>>>>>> Of course we have to save %100 of data. But we feel like
>even
>> saving %50 of our data will be make us very happy right now.
>> > >>>>>>>>
>> > >>>>>>>> This is what happens when the cluster starts. I believe it
>> explains the whole story very nicely.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> This is our ceph.conf:
>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>> > >>>>>>>>
>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s
>&&
>> ceph health”:
>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>> > >>>>>>>>
>> > >>>>>>>> This is pg dump:
>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
>> > >>>>>>>>
>> > >>>>>>>> This is iostat & perf top:
>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>> > >>>>>>>>
>> > >>>>>>>> This strace output of ceph-osd:
>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>> > >>>>>>>>
>> > >>>>>>>> This is OSD log (default debug):
>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>> > >>>>>>>>
>> > >>>>>>>> This is leader MON log (default debug):
>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>> > >>>>>>>>
>> > >>>>>>>> These are OSDs failed to start. Total number is 58.
>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and
>> debug_filestore = 20.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1
>and
>> debug_filestore = 20.
>> > >>>>>>>>
>>
>https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>> > >>>>>>>>
>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch
>and
>> exceeds mon epoch somehow??
>> > >>>>>>>>
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg
>> 66.f8 e60196@2018-09-28 23:57:08.251119
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642
>> build_initial_pg_history 66.f8 created 60196
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60196 - loading and decoding 0x19da8400
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process 66.d8 to_process <> waiting <> waiting_peering {}
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)
>queued
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process 66.d8 to_process <OpQueueItem(66.d8
>PGPeeringEvent(epoch_sent:
>> 72642 epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10
>> e72642)> waiting <> waiting_peering {}
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)
>pg
>> 0xb579400
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch:
>72642
>> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934]
>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206
>les/c/f
>> 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642
>pi=[72206,72642)/1
>> crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={} ps=[1~11]]
>> do_peering_event: epoch_sent: 72642 epoch_requested: 72642 NullEvt
>> +create_info
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642
>> queue_want_up_thru want 72642 <= queued 72642, currently 72206
>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1)
>> _process empty q, waiting
>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642
>> add_map_bl 60196 50012 bytes
>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60197 - loading and decoding 0x19da8880
>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642
>> add_map_bl 60197 50012 bytes
>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642
>get_map
>> 60198 - loading and decoding 0x19da9180
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net>
>wrote:
>> > >>>>>>>>
>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous
>option and
>> you should
>> > >>>>>>>> only use it in very specific circumstances when directed
>by a
>> developer.
>> > >>>>>>>> In such cases it will allow a stuck PG to peer.  But
>you're not
>> getting to
>> > >>>>>>>> that point...you're seeing some sort of resource
>exhaustion.
>> > >>>>>>>>
>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and
>all
>> need to
>> > >>>>>>>> catch up.  The way to tell if they are behind is by
>looking at
>> the 'ceph
>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest
>> OSDMap epoch tha
>> > >>>>>>>> t the mons have.  Were they really caught up when you
>unset
>> noup?
>> > >>>>>>>>
>> > >>>>>>>> I'm just catching up and haven't read the whole thread but
>I
>> haven't seen
>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk
>IO.
>> Catching up
>> > >>>>>>>> on maps could explain it but not why they wouldn't peer
>once
>> they were all
>> > >>>>>>>> marked up...
>> > >>>>>>>>
>> > >>>>>>>> sage
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>> > >>>>>>>>
>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les =
>true
>> ?
>> > >>>>>>>> Is that be usefull here? There is such a less information
>about
>> it.
>> > >>>>>>>>
>> > >>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı
>(2
>> Eki 2018 22:11):
>> > >>>>>>>>
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db
>partitions
>> when I read somewhere that that will do the proper sizing.
>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will
>> increase 8GB for each and decide a more calculated number    after
>cluster
>> settles.
>> > >>>>>>>>
>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it
>is in
>> loop. For that reason I started cluster with noup flag and waited
>OSDs to
>> reach the uptodate epoch number. After that I unset noup. But I did
>not pay
>> attention to manager logs. Let me check it, thank you!
>> > >>>>>>>>
>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a
>very
>> standard installation and no tweaks or tunings. All we ask for the
>> stability versus speed from the begining. And here we are :/
>> > >>>>>>>>
>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius
><daz...@gmail.com>
>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> I can see some issues from the osd log file. You have an
>> extremely low
>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for
>wal.
>> I would
>> > >>>>>>>> recommend cranking up rocksdb cache size as much as
>possible.
>> If you
>> > >>>>>>>> have RAM you can also increase bluestores cache size for
>hdd.
>> Default
>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills.
>You
>> also
>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are
>you
>> sure all
>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you
>aren't
>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after
>> upgrading to
>> > >>>>>>>> mimic.
>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>> > >>>>>>>> <goktug.yildi...@gmail.com> wrote:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hello Darius,
>> > >>>>>>>>
>> > >>>>>>>> Thanks for reply!
>> > >>>>>>>>
>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f
>> query” does stucks and wait forever since OSD is unresponsive.
>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it
>UP. And
>> we are certain that OSD responds again after its disk utilization
>stops.
>> > >>>>>>>>
>> > >>>>>>>> So we have a small test like that:
>> > >>>>>>>> * Stop all OSDs (168 of them)
>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts.
>It
>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” works!
>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts
>OSD1 &
>> OSD2 starts %95 disk utilization. This takes 17 minutes to finish.
>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high
>I/O
>> and it takes 25 mins to settle.
>> > >>>>>>>> * If you happen to start 5 of them at the same all of the
>OSDs
>> start high I/O again. And it takes 1 hour to finish.
>> > >>>>>>>>
>> > >>>>>>>> So in the light of these findings we flagged noup, started
>all
>> OSDs. At first there was no I/O. After 10 minutes we unset noup. All
>of 168
>> OSD started to make high I/O. And we thought that if we wait long
>enough it
>> will finish & OSDs will be responsive again. After 24hours they did
>not
>> because I/O did not finish or even slowed down.
>> > >>>>>>>> One can think that is a lot of data there to scan. But it
>is
>> just 33TB.
>> > >>>>>>>>
>> > >>>>>>>> So at short we dont know which PG is stuck so we can
>remove it.
>> > >>>>>>>>
>> > >>>>>>>> However we met an weird thing half an hour ago. We
>exported the
>> same PG from two different OSDs. One was 4.2GB and the other is
>500KB! So
>> we decided to export all OSDs for backup. Then we will delete strange
>sized
>> ones and start the cluster all over. Maybe then we could solve the
>stucked
>> or unfound PGs as you advise.
>> > >>>>>>>>
>> > >>>>>>>> Any thought would be greatly appreciated.
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius
><daz...@gmail.com>
>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hello,
>> > >>>>>>>>
>> > >>>>>>>> Currently you have 15 objects missing. I would recommend
>> finding them
>> > >>>>>>>> and making backups of them. Ditch all other osds that are
>> failing to
>> > >>>>>>>> start and concentrate on bringing online those that have
>missing
>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the
>cluster
>> and see
>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if
>not
>> turn
>> > >>>>>>>> them back on.
>> > >>>>>>>> Now get some of the pg's that are blocked and querry the
>pgs to
>> check
>> > >>>>>>>> why they are blocked. Try removing as much blocks as
>possible
>> and then
>> > >>>>>>>> remove the norebalance/norecovery flags and see if it
>starts to
>> fix
>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>> > >>>>>>>> <morphinwith...@gmail.com> wrote:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat
>> preview tech
>> > >>>>>>>> (as for Redhat).
>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb.
>There
>> are some
>> > >>>>>>>> tools to check health and also repair. But there are
>limited
>> > >>>>>>>> documentation.
>> > >>>>>>>> Anyone who has experince with it?
>> > >>>>>>>> Anyone lead/help to a proper check would be great.
>> > >>>>>>>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018
>Pzt,
>> 22:55
>> > >>>>>>>> tarihinde şunu yazdı:
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> Hi all,
>> > >>>>>>>>
>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s
>been 6
>> days since this cluster is offline. The long short story is here:
>>
>http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>> > >>>>>>>>
>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug.
>If
>> this is not to correct way I apology and please let me know.
>> > >>>>>>>>
>> > >>>>>>>> For the 6 days lots of thing happened and there were some
>> outcomes about the problem. Some of them was misjudged and some of
>them are
>> not looked deeper.
>> > >>>>>>>> However the most certain diagnosis is this: each OSD
>causes
>> very high disk I/O to its bluestore disk (WAL and DB are fine). After
>that
>> OSDs become unresponsive or very very less responsive. For example
>"ceph
>> tell osd.x version” stucks like for ever.
>> > >>>>>>>>
>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This
>is
>> our problem!
>> > >>>>>>>>
>> > >>>>>>>> This is the one we are very sure of. But we are not sure
>of the
>> reason.
>> > >>>>>>>>
>> > >>>>>>>> Here is the latest ceph status:
>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>> > >>>>>>>>
>> > >>>>>>>> This is the status after we started all of the OSDs 24
>hours
>> ago.
>> > >>>>>>>> Some of the OSDs are not started. However it didnt make
>any
>> difference when all of them was online.
>> > >>>>>>>>
>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all
>others:
>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it
>wont
>> caught from eye.
>> > >>>>>>>>
>> > >>>>>>>> This the full log the same OSD.
>> > >>>>>>>>
>https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>> > >>>>>>>>
>> > >>>>>>>> Here is the strace of the same OSD process:
>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> > >>>>>>>>
>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get
>hurts as
>> we do. I am sure we have done lots of mistakes to let this happening.
>And
>> this situation may be a example for other user and could be a
>potential bug
>> for ceph developer.
>> > >>>>>>>>
>> > >>>>>>>> Any help to figure out what is going on would be great.
>> > >>>>>>>>
>> > >>>>>>>> Best Regards,
>> > >>>>>>>> Goktug Yildirim
>> > >>>>>>>>
>> > >>>>>>>> _______________________________________________
>> > >>>>>>>> ceph-users mailing list
>> > >>>>>>>> ceph-users@lists.ceph.com
>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>
>> >
>> >

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic offline problem

Reply via email to