Re: [ceph-users] Managing larger ceph clusters

Quentin Hartman Fri, 17 Apr 2015 12:38:32 -0700

I also have a fairly small deployment of 14 nodes, 42 OSDs, but even I use
some automation. I do my OS installs and partitioning with PXE / kickstart,
then use chef for my baseline install of the "normal" server stuff in our
env and admin accounts. Then the ceph-specific stuff I handle by hand and
with ceph-deploy and some light wrapper scripts. Monitoring / alerting is
sensu and graphite. I tried Calamari, and it was nice. But it produced a
lot of load on the admin machine (especially considering the work it should
have been performing) and once I figured out how to get metrics into
"normal" graphite, the appeal of a ceph-specific tool was reduced
substantially.


QH

On Fri, Apr 17, 2015 at 1:07 PM, Steve Anthony <sma...@lehigh.edu> wrote:

>  For reference, I'm currently running 26 nodes (338 OSDs); will be 35
> nodes (455 OSDs) in the near future.
>
> Node/OSD provisioning and replacements:
>
> Mostly I'm using ceph-deploy, at least to do node/osd adds and
> replacements. Right now the process is:
>
> Use FAI (http://fai-project.org) to setup software RAID1/LVM for the OS
> disks, and do a minimal installation, including the salt-minion.
>
> Accept the new minion on the salt-master node and deploy the
> configuration. LDAP auth, nrpe, diamond collector, udev configuration,
> custom python disk add script, and everything on the Ceph preflight page (
> http://ceph.com/docs/firefly/start/quick-start-preflight/)
>
> Insert the journals into the case. Udev triggers my python code, which
> partitions the SSDs and fires a Prowl alert (http://www.prowlapp.com/) to
> my phone when it's finished.
>
> Insert the OSDs into the case. Same thing, udev triggers the python code,
> which selects the next available partition on the journals so OSDs go on
> journal1partA, journal2partA, journal3partA, journal1partB,... for the
> three journals in each node. The code then fires a salt event at the master
> node with the OSD dev path, journal /dev/by-id/ path and node hostname. The
> salt reactor on the master node takes this event and runs a script on the
> admin node which passes those parameters to ceph-deploy, which does the OSD
> deployment. Send Prowl alert on success or fail with details.
>
> Similarity, when an OSD fails, I remove it, and insert the new OSD. The
> same process as above occurs. Logical removal I do manually, since I'm not
> at a scale where it's common yet. Eventually, I imagine I'll write code to
> trigger OSD removal on certain events using the same event/reactor Salt
> framework.
>
> Pool/CRUSH management:
>
> Pool configuration and CRUSH management are mostly one-time operations.
> That is, I'll make a change rarely and when I do it will persist in that
> new state for a long time. Given that and the fact that I can make the
> changes from one node and inject them into the cluster, I haven't needed to
> automate that portion of Ceph as I've added more nodes, at least not yet.
>
> Replacing journals:
>
> I haven't had to do this yet; I'd probably remove/readd all the OSDs if it
> happened today, but will be reading the post you linked.
>
> Upgrading releases:
>
> Change the configuration of /etc/apt/source.list.d/ceph.list to point at
> new release and push to all the nodes with Salt. Then salt -N 'ceph'
> pkg.upgrade to upgrade the packages on all the nodes in the ceph nodegroup.
> Then, use Salt to restart the monitors, then the OSDs on each node, one by
> one. Finally run the following command on all nodes with Salt to verify all
> monitors/OSDs are using the new version:
>
> for i in $(ls /var/run/ceph/ceph-*.asok);do echo $i;ceph --admin-daemon $i
> version;done
>
> Node decommissioning:
>
> I have a script which enumerates all the OSDs on a given host and stores
> that list in a file. Another script (run by cron every 10 minutes) checks
> if the cluster health is OK, and if so pops the next OSD from that file and
> executes the steps to remove it from the host, trickling the node out of
> service.
>
>
>
>
>
> On 04/17/2015 02:18 PM, Craig Lewis wrote:
>
> I'm running a small cluster, but I'll chime in since nobody else has.
>
>  Cern had a presentation a while ago (dumpling time-frame) about their
> deployment.  They go over some of your questions:
> http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
>
>  My philosophy on Config Management is that it should save me time.  If
> it's going to take me longer to write a recipe to do something, I'll just
> do it by hand. Since my cluster is small, there are many things I can do
> faster by hand.  This may or may not work for you, depending on your
> documentation / repeatability requirements.  For things that need to be
> documented, I'll usually write the recipe anyway (I accept Chef recipes as
> documentation).
>
>
>  For my clusters, I'm using Chef to setups all nodes and manage
> ceph.conf.  I manually manage my pools, CRUSH map, RadosGW users, and disk
> replacement.  I was using Chef to add new disks, but I ran into load
> problems due to my small cluster size.  I'm currently adding disks
> manually, to manage cluster load better.  As my cluster gets larger,
> that'll be less important.
>
>  I'm also doing upgrades manually, because it's less work than writing
> the Chef recipe to do a cluster upgrade.  Since Chef isn't cluster aware,
> it would be a a pain to make the recipe cluster aware enough to handle the
> upgrade.  And I figure if I stall long enough, somebody else will write it
> :-)  Ansible, with it's cluster wide coordination, looks like it would
> handle that a bit better.
>
>
>
> On Wed, Apr 15, 2015 at 2:05 PM, Stillwell, Bryan <
> bryan.stillw...@twcable.com> wrote:
>
>> I'm curious what people managing larger ceph clusters are doing with
>> configuration management and orchestration to simplify their lives?
>>
>> We've been using ceph-deploy to manage our ceph clusters so far, but
>> feel that moving the management of our clusters to standard tools would
>> provide a little more consistency and help prevent some mistakes that
>> have happened while using ceph-deploy.
>>
>> We're looking at using the same tools we use in our OpenStack
>> environment (puppet/ansible), but I'm interested in hearing from people
>> using chef/salt/juju as well.
>>
>> Some of the cluster operation tasks that I can think of along with
>> ideas/concerns I have are:
>>
>> Keyring management
>>   Seems like hiera-eyaml is a natural fit for storing the keyrings.
>>
>> ceph.conf
>>   I believe the puppet ceph module can be used to manage this file, but
>>   I'm wondering if using a template (erb?) might be better method to
>>   keeping it organized and properly documented.
>>
>> Pool configuration
>>   The puppet module seems to be able to handle managing replicas and the
>>   number of placement groups, but I don't see support for erasure coded
>>   pools yet.  This is probably something we would want the initial
>>   configuration to be set up by puppet, but not something we would want
>>   puppet changing on a production cluster.
>>
>> CRUSH maps
>>   Describing the infrastructure in yaml makes sense.  Things like which
>>   servers are in which rows/racks/chassis.  Also describing the type of
>>   server (model, number of HDDs, number of SSDs) makes sense.
>>
>> CRUSH rules
>>   I could see puppet managing the various rules based on the backend
>>   storage (HDD, SSD, primary affinity, erasure coding, etc).
>>
>> Replacing a failed HDD disk
>>   Do you automatically identify the new drive and start using it right
>>   away?  I've seen people talk about using a combination of udev and
>>   special GPT partition IDs to automate this.  If you have a cluster
>>   with thousands of drives I think automating the replacement makes
>>   sense.  How do you handle the journal partition on the SSD?  Does
>>   removing the old journal partition and creating a new one create a
>>   hole in the partition map (because the old partition is removed and
>>   the new one is created at the end of the drive)?
>>
>> Replacing a failed SSD journal
>>   Has anyone automated recreating the journal drive using Sebastien
>>   Han's instructions, or do you have to rebuild all the OSDs as well?
>>
>>
>>
>> http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
>> rnal-failure/
>> <http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou%0Arnal-failure/>
>>
>> Adding new OSD servers
>>   How are you adding multiple new OSD servers to the cluster?  I could
>>   see an ansible playbook which disables nobackfill, noscrub, and
>>   nodeep-scrub followed by adding all the OSDs to the cluster being
>>   useful.
>>
>> Upgrading releases
>>   I've found an ansible playbook for doing a rolling upgrade which looks
>>   like it would work well, but are there other methods people are using?
>>
>>
>>
>> http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
>> ble/
>> <http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi%0Able/>
>>
>> Decommissioning hardware
>>   Seems like another ansible playbook for reducing the OSDs weights to
>>   zero, marking the OSDs out, stopping the service, removing the OSD ID,
>>   removing the CRUSH entry, unmounting the drives, and finally removing
>>   the server would be the best method here.  Any other ideas on how to
>>   approach this?
>>
>>
>> That's all I can think of right now.  Is there any other tasks that
>> people have run into that are missing from this list?
>>
>> Thanks,
>> Bryan
>>
>>
>> This E-mail and any of its attachments may contain Time Warner Cable
>> proprietary information, which is privileged, confidential, or subject to
>> copyright belonging to Time Warner Cable. This E-mail is intended solely
>> for the use of the individual or entity to which it is addressed. If you
>> are not the intended recipient of this E-mail, you are hereby notified that
>> any dissemination, distribution, copying, or action taken in relation to
>> the contents of and attachments to this E-mail is strictly prohibited and
>> may be unlawful. If you have received this E-mail in error, please notify
>> the sender immediately and permanently delete the original and any copy of
>> this E-mail and any printout.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> _______________________________________________
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Steve Anthony
> LTS HPC Support Specialist
> Lehigh universitysma...@lehigh.edu
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Managing larger ceph clusters

Reply via email to