On Thu, May 16, 2019 at 3:55 PM Mark Lehrer <leh...@gmail.com> wrote:
> > Steps 3-6 are to get the drive lvm volume back > > How much longer will we have to deal with LVM? If we can migrate non-LVM > drives from earlier versions, how about we give ceph-volume the ability to > create non-LVM OSDs directly? > We aren't requiring LVM exclusively, there is for example a ZFS plugin already, so I would say that if you want to have something like partitions, you can as a plugin (that would need to be developed). We are concentrating in LVM because we think that is the way to go. > > > On Thu, May 16, 2019 at 1:20 PM Tarek Zegar <tze...@us.ibm.com> wrote: > >> FYI for anyone interested, below is how to recover from a someone >> removing a NVME drive (the first two steps show how mine were removed and >> brought back) >> Steps 3-6 are to get the drive lvm volume back AND get the OSD daemon >> running for the drive >> >> 1. echo 1 > /sys/block/nvme0n1/device/device/remove >> 2. echo 1 > /sys/bus/pci/rescan >> 3. vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay >> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 >> 4. ceph auth add osd.122 osd 'allow *' mon 'allow rwx' -i >> /var/lib/ceph/osd/ceph-122/keyring >> 5. ceph-volume lvm activate --all >> 6. You should see the drive somewhere in the ceph tree, move it to the >> right host >> >> Tarek >> >> >> >> [image: Inactive hide details for "Tarek Zegar" ---05/15/2019 10:32:27 >> AM---TLDR; I activated the drive successfully but the daemon won]"Tarek >> Zegar" ---05/15/2019 10:32:27 AM---TLDR; I activated the drive successfully >> but the daemon won't start, looks like it's complaining abo >> >> From: "Tarek Zegar" <tze...@us.ibm.com> >> To: Alfredo Deza <ad...@redhat.com> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Date: 05/15/2019 10:32 AM >> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error, >> recovered, to restore OSD process >> Sent by: "ceph-users" <ceph-users-boun...@lists.ceph.com> >> ------------------------------ >> >> >> >> TLDR; I activated the drive successfully but the daemon won't start, >> looks like it's complaining about mon config, idk why (there is a valid >> ceph.conf on the host). Thoughts? I feel like it's close. Thank you >> >> I executed the command: >> ceph-volume lvm activate --all >> >> >> It found the drive and activated it: >> --> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a >> .... >> --> ceph-volume lvm activate successful for osd ID: 122 >> >> >> >> However, systemd would not start the OSD process 122: >> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15 >> 14:16:13.862 7ffff1970700 -1 monclient(hunting): handle_auth_bad_method >> server allowed_methods [2] but i only support [2] >> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15 >> 14:16:13.862 7ffff116f700 -1 monclient(hunting): handle_auth_bad_method >> server allowed_methods [2] but i only support [2] >> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]:* failed to >> fetch mon config (--no-mon-config to skip)* >> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> Main process exited, code=exited, status=1/FAILURE >> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> *Failed >> with result 'exit-code'.* >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> Service hold-off time over, scheduling restart. >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> Scheduled restart job, restart counter is at 3. >> -- Subject: Automatic restarting of a unit has been scheduled >> -- Defined-By: systemd >> -- Support: *http://www.ubuntu.com/support* >> <http://www.ubuntu.com/support> >> -- >> -- Automatic restarting of the unit ceph-osd@122.service has been >> scheduled, as the result for >> -- the configured Restart= setting for the unit. >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object >> storage daemon osd.122. >> -- Subject: Unit ceph-osd@122.service has finished shutting down >> -- Defined-By: systemd >> -- Support: *http://www.ubuntu.com/support* >> <http://www.ubuntu.com/support> >> -- >> -- Unit ceph-osd@122.service has finished shutting down. >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> Start request repeated too quickly. >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: >> Failed with result 'exit-code'. >> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: *Failed to start Ceph >> object storage daemon osd.122* >> >> >> >> [image: Inactive hide details for Alfredo Deza ---05/15/2019 08:27:13 >> AM---On Tue, May 14, 2019 at 7:24 PM Bob R <b...@drinksbeer.org>]Alfredo >> Deza ---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R < >> b...@drinksbeer.org> wrote: > >> >> From: Alfredo Deza <ad...@redhat.com> >> To: Bob R <b...@drinksbeer.org> >> Cc: Tarek Zegar <tze...@us.ibm.com>, ceph-users < >> ceph-users@lists.ceph.com> >> Date: 05/15/2019 08:27 AM >> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error, >> recovered, to restore OSD process >> ------------------------------ >> >> >> >> On Tue, May 14, 2019 at 7:24 PM Bob R <b...@drinksbeer.org> wrote: >> > >> > Does 'ceph-volume lvm list' show it? If so you can try to activate it >> with 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4' >> >> Good suggestion. If `ceph-volume lvm list` can see it, it can probably >> activate it again. You can activate it with the OSD ID + OSD FSID, or >> do: >> >> ceph-volume lvm activate --all >> >> You didn't say if the OSD wasn't coming up after trying to start it >> (the systemd unit should still be there for ID 122), or if you tried >> rebooting and that OSD didn't come up. >> >> The systemd unit is tied to both the ID and FSID of the OSD, so it >> shouldn't matter if the underlying device changed since ceph-volume >> ensures it is the right one every time it activates. >> > >> > Bob >> > >> > On Tue, May 14, 2019 at 7:35 AM Tarek Zegar <tze...@us.ibm.com> wrote: >> >> >> >> Someone nuked and OSD that had 1 replica PGs. They accidentally did >> echo 1 > /sys/block/nvme0n1/device/device/remove >> >> We got it back doing a echo 1 > /sys/bus/pci/rescan >> >> However, it reenumerated as a different drive number (guess we didn't >> have udev rules) >> >> They restored the LVM volume (vgcfgrestore >> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay >> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841) >> >> >> >> lsblk >> >> nvme0n2 259:9 0 1.8T 0 diskc >> >> >> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4 >> 253:1 0 1.8T 0 lvm >> >> >> >> We are stuck here. How do we attach an OSD daemon to the drive? It was >> OSD.122 previously >> >> >> >> Thanks >> >> >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> *http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com* >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > *http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com* >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com