Re: [ceph-users] Lost OSD from PCIe error, recovered, HOW to restore OSD process

Alfredo Deza Sun, 19 May 2019 00:12:32 -0700

On Thu, May 16, 2019 at 3:55 PM Mark Lehrer <leh...@gmail.com> wrote:


> > Steps 3-6 are to get the drive lvm volume back
>
> How much longer will we have to deal with LVM?  If we can migrate non-LVM
> drives from earlier versions, how about we give ceph-volume the ability to
> create non-LVM OSDs directly?
>

We aren't requiring LVM exclusively, there is for example a ZFS plugin
already, so I would say that if you want to have something like partitions,
you can as a plugin (that would need to be developed). We are concentrating
in LVM because we think that is the way to go.


>
>
> On Thu, May 16, 2019 at 1:20 PM Tarek Zegar <tze...@us.ibm.com> wrote:
>
>> FYI for anyone interested, below is how to recover from a someone
>> removing a NVME drive (the first two steps show how mine were removed and
>> brought back)
>> Steps 3-6 are to get the drive lvm volume back AND get the OSD daemon
>> running for the drive
>>
>> 1. echo 1 > /sys/block/nvme0n1/device/device/remove
>> 2. echo 1 > /sys/bus/pci/rescan
>> 3. vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841
>> 4. ceph auth add osd.122 osd 'allow *' mon 'allow rwx' -i
>> /var/lib/ceph/osd/ceph-122/keyring
>> 5. ceph-volume lvm activate --all
>> 6. You should see the drive somewhere in the ceph tree, move it to the
>> right host
>>
>> Tarek
>>
>>
>>
>> [image: Inactive hide details for "Tarek Zegar" ---05/15/2019 10:32:27
>> AM---TLDR; I activated the drive successfully but the daemon won]"Tarek
>> Zegar" ---05/15/2019 10:32:27 AM---TLDR; I activated the drive successfully
>> but the daemon won't start, looks like it's complaining abo
>>
>> From: "Tarek Zegar" <tze...@us.ibm.com>
>> To: Alfredo Deza <ad...@redhat.com>
>> Cc: ceph-users <ceph-users@lists.ceph.com>
>> Date: 05/15/2019 10:32 AM
>> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error,
>> recovered, to restore OSD process
>> Sent by: "ceph-users" <ceph-users-boun...@lists.ceph.com>
>> ------------------------------
>>
>>
>>
>> TLDR; I activated the drive successfully but the daemon won't start,
>> looks like it's complaining about mon config, idk why (there is a valid
>> ceph.conf on the host). Thoughts? I feel like it's close. Thank you
>>
>> I executed the command:
>> ceph-volume lvm activate --all
>>
>>
>> It found the drive and activated it:
>> --> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a
>> ....
>> --> ceph-volume lvm activate successful for osd ID: 122
>>
>>
>>
>> However, systemd would not start the OSD process 122:
>> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
>> 14:16:13.862 7ffff1970700 -1 monclient(hunting): handle_auth_bad_method
>> server allowed_methods [2] but i only support [2]
>> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
>> 14:16:13.862 7ffff116f700 -1 monclient(hunting): handle_auth_bad_method
>> server allowed_methods [2] but i only support [2]
>> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]:* failed to
>> fetch mon config (--no-mon-config to skip)*
>> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
>> Main process exited, code=exited, status=1/FAILURE
>> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: 
>> *Failed
>> with result 'exit-code'.*
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
>> Service hold-off time over, scheduling restart.
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
>> Scheduled restart job, restart counter is at 3.
>> -- Subject: Automatic restarting of a unit has been scheduled
>> -- Defined-By: systemd
>> -- Support: *http://www.ubuntu.com/support*
>> <http://www.ubuntu.com/support>
>> --
>> -- Automatic restarting of the unit ceph-osd@122.service has been
>> scheduled, as the result for
>> -- the configured Restart= setting for the unit.
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object
>> storage daemon osd.122.
>> -- Subject: Unit ceph-osd@122.service has finished shutting down
>> -- Defined-By: systemd
>> -- Support: *http://www.ubuntu.com/support*
>> <http://www.ubuntu.com/support>
>> --
>> -- Unit ceph-osd@122.service has finished shutting down.
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
>> Start request repeated too quickly.
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
>> Failed with result 'exit-code'.
>> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: *Failed to start Ceph
>> object storage daemon osd.122*
>>
>>
>>
>> [image: Inactive hide details for Alfredo Deza ---05/15/2019 08:27:13
>> AM---On Tue, May 14, 2019 at 7:24 PM Bob R <b...@drinksbeer.org>]Alfredo
>> Deza ---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R <
>> b...@drinksbeer.org> wrote: >
>>
>> From: Alfredo Deza <ad...@redhat.com>
>> To: Bob R <b...@drinksbeer.org>
>> Cc: Tarek Zegar <tze...@us.ibm.com>, ceph-users <
>> ceph-users@lists.ceph.com>
>> Date: 05/15/2019 08:27 AM
>> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error,
>> recovered, to restore OSD process
>> ------------------------------
>>
>>
>>
>> On Tue, May 14, 2019 at 7:24 PM Bob R <b...@drinksbeer.org> wrote:
>> >
>> > Does 'ceph-volume lvm list' show it? If so you can try to activate it
>> with 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'
>>
>> Good suggestion. If `ceph-volume lvm list` can see it, it can probably
>> activate it again. You can activate it with the OSD ID + OSD FSID, or
>> do:
>>
>> ceph-volume lvm activate --all
>>
>> You didn't say if the OSD wasn't coming up after trying to start it
>> (the systemd unit should still be there for ID 122), or if you tried
>> rebooting and that OSD didn't come up.
>>
>> The systemd unit is tied to both the ID and FSID of the OSD, so it
>> shouldn't matter if the underlying device changed since ceph-volume
>> ensures it is the right one every time it activates.
>> >
>> > Bob
>> >
>> > On Tue, May 14, 2019 at 7:35 AM Tarek Zegar <tze...@us.ibm.com> wrote:
>> >>
>> >> Someone nuked and OSD that had 1 replica PGs. They accidentally did
>> echo 1 > /sys/block/nvme0n1/device/device/remove
>> >> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> >> However, it reenumerated as a different drive number (guess we didn't
>> have udev rules)
>> >> They restored the LVM volume (vgcfgrestore
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
>> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>> >>
>> >> lsblk
>> >> nvme0n2 259:9 0 1.8T 0 diskc
>> >>
>> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4
>> 253:1 0 1.8T 0 lvm
>> >>
>> >> We are stuck here. How do we attach an OSD daemon to the drive? It was
>> OSD.122 previously
>> >>
>> >> Thanks
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> *http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com*
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > *http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com*
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lost OSD from PCIe error, recovered, HOW to restore OSD process

Reply via email to