I also have a fairly small deployment of 14 nodes, 42 OSDs, but even I use some automation. I do my OS installs and partitioning with PXE / kickstart, then use chef for my baseline install of the "normal" server stuff in our env and admin accounts. Then the ceph-specific stuff I handle by hand and with ceph-deploy and some light wrapper scripts. Monitoring / alerting is sensu and graphite. I tried Calamari, and it was nice. But it produced a lot of load on the admin machine (especially considering the work it should have been performing) and once I figured out how to get metrics into "normal" graphite, the appeal of a ceph-specific tool was reduced substantially.
QH On Fri, Apr 17, 2015 at 1:07 PM, Steve Anthony <sma...@lehigh.edu> wrote: > For reference, I'm currently running 26 nodes (338 OSDs); will be 35 > nodes (455 OSDs) in the near future. > > Node/OSD provisioning and replacements: > > Mostly I'm using ceph-deploy, at least to do node/osd adds and > replacements. Right now the process is: > > Use FAI (http://fai-project.org) to setup software RAID1/LVM for the OS > disks, and do a minimal installation, including the salt-minion. > > Accept the new minion on the salt-master node and deploy the > configuration. LDAP auth, nrpe, diamond collector, udev configuration, > custom python disk add script, and everything on the Ceph preflight page ( > http://ceph.com/docs/firefly/start/quick-start-preflight/) > > Insert the journals into the case. Udev triggers my python code, which > partitions the SSDs and fires a Prowl alert (http://www.prowlapp.com/) to > my phone when it's finished. > > Insert the OSDs into the case. Same thing, udev triggers the python code, > which selects the next available partition on the journals so OSDs go on > journal1partA, journal2partA, journal3partA, journal1partB,... for the > three journals in each node. The code then fires a salt event at the master > node with the OSD dev path, journal /dev/by-id/ path and node hostname. The > salt reactor on the master node takes this event and runs a script on the > admin node which passes those parameters to ceph-deploy, which does the OSD > deployment. Send Prowl alert on success or fail with details. > > Similarity, when an OSD fails, I remove it, and insert the new OSD. The > same process as above occurs. Logical removal I do manually, since I'm not > at a scale where it's common yet. Eventually, I imagine I'll write code to > trigger OSD removal on certain events using the same event/reactor Salt > framework. > > Pool/CRUSH management: > > Pool configuration and CRUSH management are mostly one-time operations. > That is, I'll make a change rarely and when I do it will persist in that > new state for a long time. Given that and the fact that I can make the > changes from one node and inject them into the cluster, I haven't needed to > automate that portion of Ceph as I've added more nodes, at least not yet. > > Replacing journals: > > I haven't had to do this yet; I'd probably remove/readd all the OSDs if it > happened today, but will be reading the post you linked. > > Upgrading releases: > > Change the configuration of /etc/apt/source.list.d/ceph.list to point at > new release and push to all the nodes with Salt. Then salt -N 'ceph' > pkg.upgrade to upgrade the packages on all the nodes in the ceph nodegroup. > Then, use Salt to restart the monitors, then the OSDs on each node, one by > one. Finally run the following command on all nodes with Salt to verify all > monitors/OSDs are using the new version: > > for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph --admin-daemon $i > version;done > > Node decommissioning: > > I have a script which enumerates all the OSDs on a given host and stores > that list in a file. Another script (run by cron every 10 minutes) checks > if the cluster health is OK, and if so pops the next OSD from that file and > executes the steps to remove it from the host, trickling the node out of > service. > > > > > > On 04/17/2015 02:18 PM, Craig Lewis wrote: > > I'm running a small cluster, but I'll chime in since nobody else has. > > Cern had a presentation a while ago (dumpling time-frame) about their > deployment. They go over some of your questions: > http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern > > My philosophy on Config Management is that it should save me time. If > it's going to take me longer to write a recipe to do something, I'll just > do it by hand. Since my cluster is small, there are many things I can do > faster by hand. This may or may not work for you, depending on your > documentation / repeatability requirements. For things that need to be > documented, I'll usually write the recipe anyway (I accept Chef recipes as > documentation). > > > For my clusters, I'm using Chef to setups all nodes and manage > ceph.conf. I manually manage my pools, CRUSH map, RadosGW users, and disk > replacement. I was using Chef to add new disks, but I ran into load > problems due to my small cluster size. I'm currently adding disks > manually, to manage cluster load better. As my cluster gets larger, > that'll be less important. > > I'm also doing upgrades manually, because it's less work than writing > the Chef recipe to do a cluster upgrade. Since Chef isn't cluster aware, > it would be a a pain to make the recipe cluster aware enough to handle the > upgrade. And I figure if I stall long enough, somebody else will write it > :-) Ansible, with it's cluster wide coordination, looks like it would > handle that a bit better. > > > > On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan < > bryan.stillw...@twcable.com> wrote: > >> I'm curious what people managing larger ceph clusters are doing with >> configuration management and orchestration to simplify their lives? >> >> We've been using ceph-deploy to manage our ceph clusters so far, but >> feel that moving the management of our clusters to standard tools would >> provide a little more consistency and help prevent some mistakes that >> have happened while using ceph-deploy. >> >> We're looking at using the same tools we use in our OpenStack >> environment (puppet/ansible), but I'm interested in hearing from people >> using chef/salt/juju as well. >> >> Some of the cluster operation tasks that I can think of along with >> ideas/concerns I have are: >> >> Keyring management >> Seems like hiera-eyaml is a natural fit for storing the keyrings. >> >> ceph.conf >> I believe the puppet ceph module can be used to manage this file, but >> I'm wondering if using a template (erb?) might be better method to >> keeping it organized and properly documented. >> >> Pool configuration >> The puppet module seems to be able to handle managing replicas and the >> number of placement groups, but I don't see support for erasure coded >> pools yet. This is probably something we would want the initial >> configuration to be set up by puppet, but not something we would want >> puppet changing on a production cluster. >> >> CRUSH maps >> Describing the infrastructure in yaml makes sense. Things like which >> servers are in which rows/racks/chassis. Also describing the type of >> server (model, number of HDDs, number of SSDs) makes sense. >> >> CRUSH rules >> I could see puppet managing the various rules based on the backend >> storage (HDD, SSD, primary affinity, erasure coding, etc). >> >> Replacing a failed HDD disk >> Do you automatically identify the new drive and start using it right >> away? I've seen people talk about using a combination of udev and >> special GPT partition IDs to automate this. If you have a cluster >> with thousands of drives I think automating the replacement makes >> sense. How do you handle the journal partition on the SSD? Does >> removing the old journal partition and creating a new one create a >> hole in the partition map (because the old partition is removed and >> the new one is created at the end of the drive)? >> >> Replacing a failed SSD journal >> Has anyone automated recreating the journal drive using Sebastien >> Han's instructions, or do you have to rebuild all the OSDs as well? >> >> >> >> http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou >> rnal-failure/ >> <http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou%0Arnal-failure/> >> >> Adding new OSD servers >> How are you adding multiple new OSD servers to the cluster? I could >> see an ansible playbook which disables nobackfill, noscrub, and >> nodeep-scrub followed by adding all the OSDs to the cluster being >> useful. >> >> Upgrading releases >> I've found an ansible playbook for doing a rolling upgrade which looks >> like it would work well, but are there other methods people are using? >> >> >> >> http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi >> ble/ >> <http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi%0Able/> >> >> Decommissioning hardware >> Seems like another ansible playbook for reducing the OSDs weights to >> zero, marking the OSDs out, stopping the service, removing the OSD ID, >> removing the CRUSH entry, unmounting the drives, and finally removing >> the server would be the best method here. Any other ideas on how to >> approach this? >> >> >> That's all I can think of right now. Is there any other tasks that >> people have run into that are missing from this list? >> >> Thanks, >> Bryan >> >> >> This E-mail and any of its attachments may contain Time Warner Cable >> proprietary information, which is privileged, confidential, or subject to >> copyright belonging to Time Warner Cable. This E-mail is intended solely >> for the use of the individual or entity to which it is addressed. If you >> are not the intended recipient of this E-mail, you are hereby notified that >> any dissemination, distribution, copying, or action taken in relation to >> the contents of and attachments to this E-mail is strictly prohibited and >> may be unlawful. If you have received this E-mail in error, please notify >> the sender immediately and permanently delete the original and any copy of >> this E-mail and any printout. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > _______________________________________________ > ceph-users mailing > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Steve Anthony > LTS HPC Support Specialist > Lehigh universitysma...@lehigh.edu > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com