Re: [ceph-users] Ceph Maintenance

Mike Jacobacci Tue, 29 Nov 2016 19:12:54 -0800

Found some more info, but getting weird... All three OSD nodes shows the
same unknown cluster message on all the OSD disks.  I don't know where it
came from, all the nodes were configured using ceph-deploy on the admin
node.  In any case, the OSD's seem to be up and running, the health is ok.


no ceph-disk@ services are running on any of the OSD nodes which I didn't
notice before and each node was setup the exact same, yet there are
different services listed under systemctl:

OSD NODE 1:
Output in earlier email

OSD NODE 2:

● ceph-disk@dev-sdb1.service
    loaded failed failed    Ceph disk activation: /dev/sdb1

● ceph-disk@dev-sdb2.service
    loaded failed failed    Ceph disk activation: /dev/sdb2

● ceph-disk@dev-sdb5.service
    loaded failed failed    Ceph disk activation: /dev/sdb5

● ceph-disk@dev-sdc2.service
    loaded failed failed    Ceph disk activation: /dev/sdc2

● ceph-disk@dev-sdc4.service
    loaded failed failed    Ceph disk activation: /dev/sdc4


OSD NODE 3:

● ceph-disk@dev-sdb1.service
    loaded failed failed    Ceph disk activation: /dev/sdb1

● ceph-disk@dev-sdb3.service
    loaded failed failed    Ceph disk activation: /dev/sdb3

● ceph-disk@dev-sdb4.service
    loaded failed failed    Ceph disk activation: /dev/sdb4

● ceph-disk@dev-sdb5.service
    loaded failed failed    Ceph disk activation: /dev/sdb5

● ceph-disk@dev-sdc2.service
    loaded failed failed    Ceph disk activation: /dev/sdc2

● ceph-disk@dev-sdc3.service
    loaded failed failed    Ceph disk activation: /dev/sdc3

● ceph-disk@dev-sdc4.service
    loaded failed failed    Ceph disk activation: /dev/sdc4

>From my understanding, the disks have already been activated... Should
these services even be running or enabled?

Mike



On Tue, Nov 29, 2016 at 6:33 PM, Mike Jacobacci <mi...@flowjo.com> wrote:

> Sorry about that... Here is the output of ceph-disk list:
>
> ceph-disk list
> /dev/dm-0 other, xfs, mounted on /
> /dev/dm-1 swap, swap
> /dev/dm-2 other, xfs, mounted on /home
> /dev/sda :
>  /dev/sda2 other, LVM2_member
>  /dev/sda1 other, xfs, mounted on /boot
> /dev/sdb :
>  /dev/sdb1 ceph journal
>  /dev/sdb2 ceph journal
>  /dev/sdb3 ceph journal
>  /dev/sdb4 ceph journal
>  /dev/sdb5 ceph journal
> /dev/sdc :
>  /dev/sdc1 ceph journal
>  /dev/sdc2 ceph journal
>  /dev/sdc3 ceph journal
>  /dev/sdc4 ceph journal
>  /dev/sdc5 ceph journal
> /dev/sdd :
>  /dev/sdd1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.0
> /dev/sde :
>  /dev/sde1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.1
> /dev/sdf :
>  /dev/sdf1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.2
> /dev/sdg :
>  /dev/sdg1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.3
> /dev/sdh :
>  /dev/sdh1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.4
> /dev/sdi :
>  /dev/sdi1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.5
> /dev/sdj :
>  /dev/sdj1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.6
> /dev/sdk :
>  /dev/sdk1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.7
> /dev/sdl :
>  /dev/sdl1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.8
> /dev/sdm :
>  /dev/sdm1 ceph data, active, unknown cluster 
> e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9,
> osd.9
>
>
>
> On Tue, Nov 29, 2016 at 6:32 PM, Mike Jacobacci <mi...@flowjo.com> wrote:
>
>> I forgot to add:
>>
>>
>> On Tue, Nov 29, 2016 at 6:28 PM, Mike Jacobacci <mi...@flowjo.com> wrote:
>>
>>> So it looks like the journal partition is mounted:
>>>
>>> ls -lah /var/lib/ceph/osd/ceph-0/journal
>>> lrwxrwxrwx. 1 ceph ceph 9 Oct 10 16:11 /var/lib/ceph/osd/ceph-0/journal
>>> -> /dev/sdb1
>>>
>>> Here is the output of journalctl -xe when I try to start the
>>> ceph-diak@dev-sdb1 service:
>>>
>>> sh[17481]: mount_activate: Failed to activate
>>> sh[17481]: unmount: Unmounting /var/lib/ceph/tmp/mnt.m9ek7W
>>> sh[17481]: command_check_call: Running command: /bin/umount --
>>> /var/lib/ceph/tmp/mnt.m9ek7W
>>> sh[17481]: Traceback (most recent call last):
>>> sh[17481]: File "/usr/sbin/ceph-disk", line 9, in <module>
>>> sh[17481]: load_entry_point('ceph-disk==1.0.0', 'console_scripts',
>>> 'ceph-disk')()
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 5011, in run
>>> sh[17481]: main(sys.argv[1:])
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 4962, in main
>>> sh[17481]: args.func(args)
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 4720, in <lambda>
>>> sh[17481]: func=lambda args: main_activate_space(name, args),
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 3739, in main_activate_space
>>> sh[17481]: reactivate=args.reactivate,
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 3073, in mount_activate
>>> sh[17481]: (osd_id, cluster) = activate(path, activate_key_template,
>>> init)
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 3220, in activate
>>> sh[17481]: ' with fsid %s' % ceph_fsid)
>>> sh[17481]: ceph_disk.main.Error: Error: No cluster conf found in
>>> /etc/ceph with fsid e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
>>> sh[17481]: Traceback (most recent call last):
>>> sh[17481]: File "/usr/sbin/ceph-disk", line 9, in <module>
>>> sh[17481]: load_entry_point('ceph-disk==1.0.0', 'console_scripts',
>>> 'ceph-disk')()
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 5011, in run
>>> sh[17481]: main(sys.argv[1:])
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 4962, in main
>>> sh[17481]: args.func(args)
>>> sh[17481]: File "/usr/lib/python2.7/site-packages/ceph_disk/main.py",
>>> line 4399, in main_trigger
>>> sh[17481]: raise Error('return code ' + str(ret))
>>> sh[17481]: ceph_disk.main.Error: Error: return code 1
>>> systemd[1]: ceph-disk@dev-sdb1.service: main process exited,
>>> code=exited, status=1/FAILURE
>>> systemd[1]: Failed to start Ceph disk activation: /dev/sdb1.
>>>
>>> I dont understand this error:
>>> ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with
>>> fsid e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
>>>
>>> My fsid in ceph.conf is:
>>> fsid = 75d6dba9-2144-47b1-87ef-1fe21d3c58a8
>>>
>>> I don't know why the fsid would change or be different. I thought I had
>>> a basic cluster setup, I don't understand what's going wrong.
>>>
>>> Mike
>>>
>>> On Tue, Nov 29, 2016 at 5:15 PM, Mike Jacobacci <mi...@flowjo.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> Thanks I wasn't sure if something happened to the journal partitions or
>>>> not.
>>>>
>>>> Right now, the ceph-osd.0-9 services are back up and the cluster health
>>>> is good, but none of the ceph-disk@dev-sd* services are running.   How
>>>> can I get the Journal partitions mounted again?
>>>>
>>>> Cheers,
>>>> Mike
>>>>
>>>> On Tue, Nov 29, 2016 at 4:30 PM, John Petrini <jpetr...@coredial.com>
>>>> wrote:
>>>>
>>>>> Also, don't run sgdisk again; that's just for creating the journal
>>>>> partitions. ceph-disk is a service used for prepping disks, only the OSD
>>>>> services need to be running as far as I know. Are the ceph-osd@x.
>>>>> services running now that you've mounted the disks?
>>>>>
>>>>> ___
>>>>>
>>>>> John Petrini
>>>>>
>>>>> NOC Systems Administrator   //   *CoreDial, LLC*   //   coredial.com
>>>>>    //   [image: Twitter] <https://twitter.com/coredial>   [image:
>>>>> LinkedIn] <http://www.linkedin.com/company/99631>   [image: Google
>>>>> Plus] <https://plus.google.com/104062177220750809525/posts>   [image:
>>>>> Blog] <http://success.coredial.com/blog>
>>>>> Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
>>>>> *P: *215.297.4400 x232   //   *F: *215.297.4401   //   *E: *
>>>>> jpetr...@coredial.com
>>>>>
>>>>> [image: Exceptional people. Proven Processes. Innovative Technology.
>>>>> Discover CoreDial - watch our video]
>>>>> <http://cta-redirect.hubspot.com/cta/redirect/210539/4c492538-6e4b-445e-9480-bef676787085>
>>>>>
>>>>> The information transmitted is intended only for the person or entity
>>>>> to which it is addressed and may contain confidential and/or privileged
>>>>> material. Any review, retransmission,  dissemination or other use of, or
>>>>> taking of any action in reliance upon, this information by persons or
>>>>> entities other than the intended recipient is prohibited. If you received
>>>>> this in error, please contact the sender and delete the material from any
>>>>> computer.
>>>>>
>>>>> On Tue, Nov 29, 2016 at 7:27 PM, John Petrini <jpetr...@coredial.com>
>>>>> wrote:
>>>>>
>>>>>> What command are you using to start your OSD's?
>>>>>>
>>>>>> ___
>>>>>>
>>>>>> John Petrini
>>>>>>
>>>>>> NOC Systems Administrator   //   *CoreDial, LLC*   //   coredial.com
>>>>>>    //   [image: Twitter] <https://twitter.com/coredial>   [image:
>>>>>> LinkedIn] <http://www.linkedin.com/company/99631>   [image: Google
>>>>>> Plus] <https://plus.google.com/104062177220750809525/posts>   [image:
>>>>>> Blog] <http://success.coredial.com/blog>
>>>>>> Hillcrest I, 751 Arbor Way, Suite 150, Blue Bell PA, 19422
>>>>>> *P: *215.297.4400 x232   //   *F: *215.297.4401   //   *E: *
>>>>>> jpetr...@coredial.com
>>>>>>
>>>>>> [image: Exceptional people. Proven Processes. Innovative Technology.
>>>>>> Discover CoreDial - watch our video]
>>>>>> <http://cta-redirect.hubspot.com/cta/redirect/210539/4c492538-6e4b-445e-9480-bef676787085>
>>>>>>
>>>>>> The information transmitted is intended only for the person or entity
>>>>>> to which it is addressed and may contain confidential and/or privileged
>>>>>> material. Any review, retransmission,  dissemination or other use of, or
>>>>>> taking of any action in reliance upon, this information by persons or
>>>>>> entities other than the intended recipient is prohibited. If you received
>>>>>> this in error, please contact the sender and delete the material from any
>>>>>> computer.
>>>>>>
>>>>>> On Tue, Nov 29, 2016 at 7:19 PM, Mike Jacobacci <mi...@flowjo.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I was able to bring the osd's up by looking at my other OSD node
>>>>>>> which is the exact same hardware/disks and finding out which disks map.
>>>>>>> But I still cant bring up any of the start ceph-disk@dev-sd*
>>>>>>> services... When I first installed the cluster and got the OSD's up, I 
>>>>>>> had
>>>>>>> to run the following:
>>>>>>>
>>>>>>> # sgdisk -t 1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
>>>>>>>
>>>>>>> # sgdisk -t 2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
>>>>>>>
>>>>>>> # sgdisk -t 3:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
>>>>>>>
>>>>>>> # sgdisk -t 4:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
>>>>>>>
>>>>>>> # sgdisk -t 5:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdb
>>>>>>>
>>>>>>> # sgdisk -t 1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>>>>>>>
>>>>>>> # sgdisk -t 2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>>>>>>>
>>>>>>> # sgdisk -t 3:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>>>>>>>
>>>>>>> # sgdisk -t 4:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>>>>>>>
>>>>>>> # sgdisk -t 5:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/sdc
>>>>>>>
>>>>>>>
>>>>>>> Do i need to run that again?
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Tue, Nov 29, 2016 at 4:13 PM, Sean Redmond <
>>>>>>> sean.redmo...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Normally they mount based upon the gpt label, if it's not working
>>>>>>>> you can mount the disk under /mnt and then cat the file called whoami 
>>>>>>>> to
>>>>>>>> find out the osd number
>>>>>>>>
>>>>>>>> On 29 Nov 2016 23:56, "Mike Jacobacci" <mi...@flowjo.com> wrote:
>>>>>>>>
>>>>>>>>> OK I am in some trouble now and would love some help!  After
>>>>>>>>> updating none of the OSDs on the node will come back up:
>>>>>>>>>
>>>>>>>>> ● ceph-disk@dev-sdb1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdb1
>>>>>>>>> ● ceph-disk@dev-sdb2.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdb2
>>>>>>>>> ● ceph-disk@dev-sdb3.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdb3
>>>>>>>>> ● ceph-disk@dev-sdb4.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdb4
>>>>>>>>> ● ceph-disk@dev-sdb5.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdb5
>>>>>>>>> ● ceph-disk@dev-sdc1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdc1
>>>>>>>>> ● ceph-disk@dev-sdc2.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdc2
>>>>>>>>> ● ceph-disk@dev-sdc3.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdc3
>>>>>>>>> ● ceph-disk@dev-sdc4.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdc4
>>>>>>>>> ● ceph-disk@dev-sdc5.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdc5
>>>>>>>>> ● ceph-disk@dev-sdd1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdd1
>>>>>>>>> ● ceph-disk@dev-sde1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sde1
>>>>>>>>> ● ceph-disk@dev-sdf1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdf1
>>>>>>>>> ● ceph-disk@dev-sdg1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdg1
>>>>>>>>> ● ceph-disk@dev-sdh1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdh1
>>>>>>>>> ● ceph-disk@dev-sdi1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdi1
>>>>>>>>> ● ceph-disk@dev-sdj1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdj1
>>>>>>>>> ● ceph-disk@dev-sdk1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdk1
>>>>>>>>> ● ceph-disk@dev-sdl1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdl1
>>>>>>>>> ● ceph-disk@dev-sdm1.service
>>>>>>>>>          loaded failed failed    Ceph disk activation: /dev/sdm1
>>>>>>>>> ● ceph-osd@0.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@1.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@2.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@3.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@4.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@5.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@6.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@7.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@8.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>> ● ceph-osd@9.service
>>>>>>>>>          loaded failed failed    Ceph object storage daemon
>>>>>>>>>
>>>>>>>>> I did some searching and saw that the issue is that the disks
>>>>>>>>> aren't mounting... My question is how can I mount them correctly again
>>>>>>>>> (note sdb and sdc are ssd for cache)? I am not sure which disk maps to
>>>>>>>>> ceph-osd@0 and so on.  Also, can I add them to /etc/fstab to work
>>>>>>>>> around?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Mike
>>>>>>>>>
>>>>>>>>> On Tue, Nov 29, 2016 at 10:41 AM, Mike Jacobacci <mi...@flowjo.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I would like to install OS updates on the ceph cluster and
>>>>>>>>>> activate a second 10gb port on the OSD nodes, so I wanted to verify 
>>>>>>>>>> the
>>>>>>>>>> correct steps to perform maintenance on the cluster.  We are only 
>>>>>>>>>> using rbd
>>>>>>>>>> to back our xenserver vm's at this point, and our cluster consists 
>>>>>>>>>> of 3 OSD
>>>>>>>>>> nodes, 3 Mon nodes and 1 admin node...  So would this be the correct 
>>>>>>>>>> steps:
>>>>>>>>>>
>>>>>>>>>> 1. Shut down VM's?
>>>>>>>>>> 2. run "ceph osd set noout" on admin node
>>>>>>>>>> 3. install updates on each monitoring node and reboot one at a
>>>>>>>>>> time.
>>>>>>>>>> 4. install updates on OSD nodes and activate second 10gb port,
>>>>>>>>>> reboot one OSD node at a time
>>>>>>>>>> 5. once all nodes back up, run "ceph osd unset noout"
>>>>>>>>>> 6. bring VM's back online
>>>>>>>>>>
>>>>>>>>>> Does this sound correct?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Mike
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Maintenance

Reply via email to