Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

Adrien Georget Wed, 03 Jul 2019 04:47:16 -0700

Hi,

With --debug-objecter=20, I found that the rados ls command hangslooping on laggy messages :

||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter_op_submit op 0x7efc3800dc10||||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter_calc_target epoch 13146 base @3 precalc_pgid 1 pgid 3.100 is_read||||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter_calc_target target @3 -> pgid 3.100||||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter_calc_target raw pgid 3.100 -> actual 3.100 acting [29,12,55] primary 29||||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter_get_session s=0x7efc380024c0 osd=29 3||||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter_op_submit oid '@3' '@3' [pgnls start_epoch 13146] tid 11 osd.29||||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecterget_session s=0x7efc380024c0 osd=29 3||||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter_session_op_assign 29 11||||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter_send_op 11 to 3.100 on osd.29||||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecterput_session s=0x7efc380024c0 osd=29 4||||2019-07-03 13:33:24.913 7efc402f5700 5 client.21363886.objecter 1 inflight||

||2019-07-03 13:33:29.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:34.678 7efc3e2f1700 10 client.21363886.objecter tick||
||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter tick||

||2019-07-03 13:33:39.678 7efc3e2f1700 2 client.21363886.objecter tid11 on osd.29 is laggy||||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter tick||

||2019-07-03 13:33:44.678 7efc3e2f1700 2 client.21363886.objecter tid11 on osd.29 is laggy||||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter_maybe_request_map subscribing (onetime) to next osd map||

||2019-07-03 13:33:49.679 7efc3e2f1700 10 client.21363886.objecter tick
...


|I tried to disable this OSD but the problem goes on another OSD, and so on.

The ceph client packages are up to date, all RBD command still work froma monitor but not from Openstack controllers.And the other Ceph pool on the same OSD host but on different disksworks perfectly with Openstack...

The issue looks like these old on, but It seems fixed since fews years :https://tracker.ceph.com/issues/2454 andhttps://tracker.ceph.com/issues/8515


Is there anything more I can check?

Adrien


Le 02/07/2019 à 14:10, Adrien Georget a écrit :

Hi Eugen,
The cinder keyring used by the 2 pools is the same, the rbd commandworks using this keyring and ceph.conf used by Openstack while therados ls command stays stuck.
I tried with the previous ceph-common version used 10.2.5 and the lastceph version 14.2.1.With the Nautilus ceph-common version, the 2 cinder-volume servicescrashed...
Adrien

Le 02/07/2019 à 13:50, Eugen Block a écrit :
Hi,
did you try to use rbd and rados commands with the cinder keyring,not the admin keyring? Did you check if the caps for that client arestill valid (do the caps differ between the two cinder pools)?
Are the ceph versions on your hypervisors also nautilus?

Regards,
Eugen


Zitat von Adrien Georget <adrien.geor...@cc.in2p3.fr>:
Hi all,
I'm facing a very strange issue after migrating my Luminous clusterto Nautilus.I have 2 pools configured for Openstack cinder volumes with multiplebackend setup, One "service" Ceph pool with cache tiering and one"R&D" Ceph pool.After the upgrade, the R&D pool became inaccessible for Cinder andthe cinder-volume service using this pool can't start anymore.What is strange is that Openstack and Ceph report no error, Cephcluster is healthy, all OSDs are UP & running and the "service" poolis still running well with the other cinder service on the sameopenstack host.I followed exactly the upgrade procedure(https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous),no problem during the upgrade but I can't understand why Cinderstill fails with this pool.I can access, list, create volume on this pool with rbd or radoscommand from the monitors, but on the Openstack hypervisor the rbdor rados ls command stay stuck and rados ls give this message(|134.158.208.37 is an OSD node,10.158.246.214 an Openstackhypervisor) |:
|2019-07-02 11:26:15.999869 7f63484b4700 0 --10.158.246.214:0/1404677569 >> 134.158.208.37:6884/2457222pipe(0x555c2bf96240 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x555c2bf97500).fault|
ceph version 14.2.1
Openstack Newton
I spent 2 days checking everything on Ceph side but I couldn't findanything problematic...
If you have any hints which can help me, I would appreciate :)

Adrien
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade

Reply via email to