Hello!

I've run into a bit of an issue with one of our radosgw production clusters..

Setup is two radosgw nodes behind haproxy loadbalancing, which in turn are 
connected to the ceph cluster. Everything running 14.2.2 so Nautilus. It's tied 
to a openstack cluster, so keystone as authentication backend (should really 
matter though).

Today both rgw backends crashed. Checking logs it seems to be related to 
dynamic resharding of a bucket, causing Lock errors:

Logs snippet: https://pastebin.com/uBCnhinF

Checking 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021368.html 
(old), I performed a manual reshard of affected bucket with success 
(radosgw-admin bucket reshard --bucket="XXX/YYY" --num-shards=256)

Checking the metadata for bucket, it now correctly shows 256, up from 128.

HOWEVER, the dynamic resharding still kept happening and bringing down the 
backeds. I suspect it is because of the old reshard op hanging around when 
checking a `reshard list`: https://pastebin.com/dPChwBCT

As the resharding seems to have been successful when running manually, I now 
want to remove that reshard op, but can't, getting this 
https://pastebin.com/071kfAsa error when trying..

Right now I had to resort to setting rgw_dynamic_resharding = false in 
ceph.conf to stop the problem from occuring.

Ideas? 

Cheers
Erik

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to