Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reed Dier Thu, 11 Jan 2018 12:25:18 -0800

Thank you for documenting your progress and peril on the ML.

Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
bluestore.


8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
able to do about 3 at a time (1 node) for rip/replace.

Definitely taking it slow and steady, and the SSDs will move quickly for 
backfills as well.
Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
everything, about 5TB average util on each 8TB disk, so just about 30 hours-ish 
per host *8 hosts will be about 10 days, so a couple weeks is a safe amount of 
headway.
This write performance certainly seems better on bluestore than filestore, so 
that likely helps as well.

Expect I can probably refill an SSD osd in about an hour or two, and will 
likely stagger those out.
But with such a small number of osd’s currently, I’m taking the by-hand 
approach rather than scripting it so as to avoid similar pitfalls.

Reed 

> On Jan 11, 2018, at 12:38 PM, Brady Deetz <bde...@gmail.com> wrote:
> 
> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
> about a disaster I created automating my migration. Good luck
> 
> On Jan 11, 2018 12:22 PM, "Reed Dier" <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> I am in the process of migrating my OSDs to bluestore finally and thought I 
> would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here: 
> https://www.spinics.net/lists/ceph-users/msg41802.html 
> <https://www.spinics.net/lists/ceph-users/msg41802.html>
> 
> My first OSD I was cautious, and I outed the OSD without downing it, allowing 
> it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
> NVMe partition previously used for journaling in filestore, intending to be 
> used for block.db in bluestore.
> 
> Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
> set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
> auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
> create the new LVM target. Then unset the norecover and norebalance flags and 
> it backfilled like normal.
> 
> I initially ran into issues with specifying --osd.id <http://osd.id/> causing 
> my osd’s to fail to start, but removing that I was able to get it to fill in 
> the gap of the OSD I just removed.
> 
> I’m now doing quicker, more destructive migrations in an attempt to reduce 
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
> read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
> 
> So I’m setting the norecover and norebalance flags, down the OSD (but not 
> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
> ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
> to offload it and then backfill back from them. I trust my disks enough to 
> backfill from the other disks, and its going well. Also seeing very good 
> write performance backfilling compared to previous drive replacements in 
> filestore, so thats very promising.
> 
> Reed
> 
>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>> <mailto:jmozd...@nde.ag>> wrote:
>> 
>> Hi Alfredo,
>> 
>> thank you for your comments:
>> 
>> Zitat von Alfredo Deza <ad...@redhat.com <mailto:ad...@redhat.com>>:
>>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>>> <mailto:jmozd...@nde.ag>> wrote:
>>>> Dear *,
>>>> 
>>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>>> keeping the OSD number? There have been a number of messages on the list,
>>>> reporting problems, and my experience is the same. (Removing the existing
>>>> OSD and creating a new one does work for me.)
>>>> 
>>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>>>  
>>>> <http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd>
>>>> - this basically says
>>>> 
>>>> 1. destroy old OSD
>>>> 2. zap the disk
>>>> 3. prepare the new OSD
>>>> 4. activate the new OSD
>>>> 
>>>> I never got step 4 to complete. The closest I got was by doing the 
>>>> following
>>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>>> 
>>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>>> ceph-osd@999.service)
>>>> 
>>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>>> 
>>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>>> volume group
>>>> 
>>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>>> 
>>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>>> --yes-i-really-mean-it)
>>>> 
>>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>>> 
>>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>>> later on, which takes care of doing this for you.
>>> 
>>>> 
>>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>>> /var/lib/ceph/osd/ceph-999/keyring)
>> 
>> I at first tried to follow the documented steps (without my steps 5 and 6), 
>> which did not work for me. The documented approach failed with "init 
>> authentication >> failed: (1) Operation not permitted", because actually 
>> ceph-volume did not add the auth entry for me.
>> 
>> But even after manually adding the authentication, the "ceph-volume" 
>> approach failed, as the OSD was still marked "destroyed" in the osdmap epoch 
>> as used by ceph-osd (see the commented messages from ceph-osd.999.log below).
>> 
>>>> 
>>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>>> --osd-id 999 --data /dev/sdzz)
>>> 
>>> You are going to hit a bug in ceph-volume that is preventing you from
>>> specifying the osd id directly if the ID has been destroyed.
>>> 
>>> See http://tracker.ceph.com/issues/22642 
>>> <http://tracker.ceph.com/issues/22642>
>> 
>> If I read that bug description correctly, you're confirming why I needed 
>> step #6 above (manually adding the OSD auth entry. But even if ceph-volume 
>> had added it, the ceph-osd.log entries suggest that starting the OSD would 
>> still have failed, because of accessing the wrong osdmap epoch.
>> 
>> To me it seems like I'm hitting a bug outside of ceph-volume - unless it's 
>> ceph-volume that somehow determines which osdmap epoch is used by ceph-osd.
>> 
>>> In order for this to work, you would need to make sure that the ID has
>>> really been destroyed and avoid passing --osd-id in ceph-volume. The
>>> caveat
>>> being that you will get whatever ID is available next in the cluster.
>> 
>> Yes, that's the work-around I then used - purge the old OSD and create a new 
>> one.
>> 
>> Thanks & regards,
>> Jens
>> 
>>>> [...]
>>>> --- cut here ---
>>>> # first of multiple attempts, before "ceph auth add ..."
>>>> # no actual epoch referenced, as login failed due to missing auth
>>>> 2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
>>>> 288232575208783872, adjusting msgr requires for clients
>>>> 2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
>>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>>> 2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
>>>> 288232575208783872, adjusting msgr requires for osds
>>>> 2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
>>>> 2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
>>>> 2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
>>>> op queue with priority op cut off at 64.
>>>> 2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
>>>> {default=true}
>>>> 2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
>>>> failed: (1) Operation not permitted
>>>> 
>>>> # after "ceph auth ..."
>>>> # note the different epochs below? BTW, 110587 is the current epoch at that
>>>> time and osd.999 is marked destroyed there
>>>> # 109892: much too old to offer any details
>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>> 
>>>> 2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
>>>> 288232575208783872, adjusting msgr requires for clients
>>>> 2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
>>>> 288232575208783872 was 8705, adjusting msgr requires for mons
>>>> 2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
>>>> 288232575208783872, adjusting msgr requires for osds
>>>> 2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
>>>> 2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
>>>> 2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using weightedpriority
>>>> op queue with priority op cut off at 64.
>>>> 2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors
>>>> {default=true}
>>>> 2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,
>>>> starting boot process
>>>> 2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for initial
>>>> osdmap
>>>> 2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map has
>>>> features 288232610642264064, adjusting msgr requires for clients
>>>> 2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map has
>>>> features 288232610642264064 was 288232575208792577, adjusting msgr requires
>>>> for mons
>>>> 2018-01-10 00:08:00.970660 7fc546614700  0 osd.999 109892 crush map has
>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>> 2018-01-10 00:08:01.349602 7fc546614700 -1 osd.999 110587 osdmap says I am
>>>> destroyed, exiting
>>>> 
>>>> # another try
>>>> # it is now using epoch 110587 for everything. But that one is off by one 
>>>> at
>>>> that time already:
>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>> # 110588: modified 2018-01-10 00:12:55.271913
>>>> 
>>>> # but both 110587 and 110588 have osd.999 as "destroyed", so never mind.
>>>> 2018-01-10 00:13:04.332026 7f408d5a4d00  0 osd.999 110587 crush map has
>>>> features 288232610642264064, adjusting msgr requires for clients
>>>> 2018-01-10 00:13:04.332037 7f408d5a4d00  0 osd.999 110587 crush map has
>>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>>> 2018-01-10 00:13:04.332043 7f408d5a4d00  0 osd.999 110587 crush map has
>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>> 2018-01-10 00:13:04.332092 7f408d5a4d00  0 osd.999 110587 load_pgs
>>>> 2018-01-10 00:13:04.332096 7f408d5a4d00  0 osd.999 110587 load_pgs opened 0
>>>> pgs
>>>> 2018-01-10 00:13:04.332100 7f408d5a4d00  0 osd.999 110587 using
>>>> weightedpriority op queue with priority op cut off at 64.
>>>> 2018-01-10 00:13:04.332990 7f408d5a4d00 -1 osd.999 110587 log_to_monitors
>>>> {default=true}
>>>> 2018-01-10 00:13:06.026628 7f408d5a4d00  0 osd.999 110587 done with init,
>>>> starting boot process
>>>> 2018-01-10 00:13:06.027627 7f4075352700 -1 osd.999 110587 osdmap says I am
>>>> destroyed, exiting
>>>> 
>>>> # the attempt after using "ceph osd new", which created epoch 110591 as the
>>>> first with osd.999 as autoout,exists,new
>>>> # But ceph-osd still uses 110587.
>>>> # 110587: modified 2018-01-09 23:43:13.202381
>>>> # 110591: modified 2018-01-10 00:30:44.850078
>>>> 
>>>> 2018-01-10 00:31:15.453871 7f1c57c58d00  0 osd.999 110587 crush map has
>>>> features 288232610642264064, adjusting msgr requires for clients
>>>> 2018-01-10 00:31:15.453882 7f1c57c58d00  0 osd.999 110587 crush map has
>>>> features 288232610642264064 was 8705, adjusting msgr requires for mons
>>>> 2018-01-10 00:31:15.453887 7f1c57c58d00  0 osd.999 110587 crush map has
>>>> features 1008808551021559808, adjusting msgr requires for osds
>>>> 2018-01-10 00:31:15.453940 7f1c57c58d00  0 osd.999 110587 load_pgs
>>>> 2018-01-10 00:31:15.453945 7f1c57c58d00  0 osd.999 110587 load_pgs opened 0
>>>> pgs
>>>> 2018-01-10 00:31:15.453952 7f1c57c58d00  0 osd.999 110587 using
>>>> weightedpriority op queue with priority op cut off at 64.
>>>> 2018-01-10 00:31:15.454862 7f1c57c58d00 -1 osd.999 110587 log_to_monitors
>>>> {default=true}
>>>> 2018-01-10 00:31:15.520533 7f1c57c58d00  0 osd.999 110587 done with init,
>>>> starting boot process
>>>> 2018-01-10 00:31:15.521278 7f1c40207700 -1 osd.999 110587 osdmap says I am
>>>> destroyed, exiting
>>>> --- cut here ---
>>>> [...]
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

Reply via email to