We run a ceph cluster with radosgw on top of it. During the installation we 
have never specified any regions or zones, which means that every bucket 
currently resides in the default region. To support a federated config we have 
built a test cluster that replicates the current production setup with the same 
default region and zone. Once that setup was running we went through the 
following steps to make the switch to a federated config. Our second zone is 
completely empty to begin with and has no data in it at this point.

1) We created a new region that includes the api_name, master_zone and 
endpoints for our two zones.
2) We created two users in zone1 and zone2 with the same access and secret key 
across the two zones.
3) We created two zones with default pools and specified the access and secret 
key.
4) We have changed ceph.conf to include the new region and zone and pushed it 
to our nodes.
5) The default region was set to our new region through radosgw-admin and the 
default was removed.
6) The regionmap was updated to reflect the changes we made to our regions.

This last step proved to be a little difficult, as radosgw-admin regionmap 
update returns:
7f7b36b7b840 -1 cannot update region map, master_region conflict

The master_region is set to 'ams' in both clusters.

It may be that we be running into issues later on because we have solved this 
the 'hard way' by changing the regionmap manually.

6) As we have changed our region and zones we have restarted radosgw. As 
expected this takes our objects offline.
7) We have updated all buckets to sit in the new region.

After our buckets have changed all of our objects are back online again. 

We have not made any changes to our pools. The new region points to the 
existing pool so this has never resulted in any physical movement of data. Once 
this was all done the cluster was up and running, as expected, but serving its 
content from the new zone.

At this point we set up radosgw-admin with the users from step 2 and 3 matching 
our zones. The first time we have done this we ran into a couple of problems. 
The first was that radosgw-admin that's available in the repository is a little 
older than the one on github. This version lacks a lot of exception handling 
and proper error output, making it difficult to diagnose issues as they come 
up. We've switched to the latest available version from github which has helped 
us a lot to get where we are now. We had to switch radosgw from sockets to tcp 
first, but the manual didn't include a specific parameter which lead to radosgw 
not being able to handle /-characters properly. Once we added 
AllowEncodedSlashes it all magically worked. 

As it took us quite some time and fiddling around to get to this point we 
wanted to replicate the exact same situation on another test environment again 
to make sure we know what to do when we would change this in a live 
environment. And this is where it all fails. We are unable to get this set up 
back up again. We've compared configurations, checked every single setting 
we've played with but we're unable to find what's going wrong. The error 
message is pretty obvious though:

2015-04-24 15:37:55,073 9406 [radosgw_agent.worker][DEBUG ] syncing object 
object/test.txt
2015-04-24 15:37:55,089 9406 [radosgw_agent.worker][DEBUG ] object 
"object/test.txt" not found on master, deleting from secondary

I was expecting to find this entry in our Apache log files. Surely it would 
trigger a 404. It turns out though that we're not seeing any log files at all. 
It's not being found at all. Though when I look at the logs in zone2 I see the 
following:

[24/Apr/2015:15:45:01 +0000] "PUT 
/object/test.txt?rgwx-op-id=radosgw1%3A9727%3A1&rgwx-source-zone=zone1&rgwx-client-id=radosgw-agent
 HTTP/1.1" 404 242 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic"
[24/Apr/2015:15:45:01 +0000] "GET /object/?max-keys=0 HTTP/1.1" 200 408 "-" 
"Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic"
[24/Apr/2015:15:45:01 +0000] "DELETE /object/test.txt HTTP/1.1" 204 126 "-" 
"Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic”

We’re running ceph and radosgw 0.94.1, the agent comes from github as the one 
that’s in the repository doesn’t seem entirely stable nor very clear on error 
messages.

I’m sure we may be missing something, but it feels like radosgw-agent isn’t 
production ready yet. Any thoughts?

Thanks,
Thomas

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to