[ceph-users] Trouble removing MDS daemons | Luminous
Good day, Firstly I'd like to acknowledge that I consider myself a Ceph noob. OS: Ubuntu 16.04.3 LTS Ceph version: 12.2.1 I'm running a small six node POC cluster with three MDS daemons. (One on each node, node1, node2 and node3) I've also configured three ceph file systems fsys1, fsys2 and fsys3. I'd like to remove two of the file systems (fsys2 and fsys3) and at least one if not both of the MDS daemons. I was able to fail MDS on node3 using command "sudo ceph mds fail node3" followed by "sudo ceph mds rmfailed 0 --yes-i-really-mean-it". Then I removed the file system using command "sudo ceph fs rm fsys3 --yes-i-really-mean-it". Running command "sudo ceph fs status" confirms that fsys3 is now failed and that the MDS daemon on node3 has become a standby MDS. I've combinations of "ceph mds (fail, deactivate, rm, rmfailed" but I can't seem to be able to remove the standby daemon. After rebooting node3 and running command "sudo ceph fs status" - fsys3 is no longer a listed file system and node3 is still standby MDS. I've searched for details on this topic but what I have found has not helped me. Could anybody assist with the correct steps for removing MDS daemons and ceph file systems on nodes? It would be useful to be able to know how to completely remove all ceph file systems and MDS daemons should I have no further use for them in a cluster. Kind regards Geoffrey Rhodes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS | Mounting Second CephFS
Hi, When running more than one cephfs how would I specify which file system I want to mount in ceph-fuse or the kernel client? OS: Ubuntu 16.04.3 LTS Ceph version: 12.2.1 - Luminous Kind regards Geoffrey Rhodes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS - Mounting a second Ceph file system
Good day, I'd like to run more than one Ceph file system in the same cluster. Can anybody point me in the right direction to explain how to mount the second file system? Thanks OS: Ubuntu 16.04.3 LTS Ceph version: 12.2.1 - Luminous Kind regards Geoffrey Rhodes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS - Mounting a second Ceph file system
Thanks John for the assistance. Geoff On 28 November 2017 at 16:30, John Spray wrote: > On Tue, Nov 28, 2017 at 2:09 PM, Geoffrey Rhodes > wrote: > > Good day, > > > > I'd like to run more than one Ceph file system in the same cluster. > > Can anybody point me in the right direction to explain how to mount the > > second file system? > > With the kernel mount you can use "-o mds_namespace=" to specify which > filesystem you want, and with the fuse client you have a > --client_mds_namespace option. > > Cheers, > John > > > > > Thanks > > > > OS: Ubuntu 16.04.3 LTS > > Ceph version: 12.2.1 - Luminous > > > > > > Kind regards > > Geoffrey Rhodes > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Adding a host node back to ceph cluster
Good day, I'm having an issue re-deploying a host back into my production ceph cluster. Due to some bad memory (picked up by a scrub) which has been replaced I felt the need to re-install the host to be sure no host files were damaged. Prior to decommissioning the host I set the crush weight's on each osd to 0. Once to osd's had flushed all data I stopped the daemon's. I then purged the osd's from the crushmap with "ceph osd purge". Followed by "ceph osd crush rm {host}" to remove the host bucket from the crush map. I also ran "ceph-deploy purge {host}" & "ceph-deploy purgedata {host}" from the management node. I then reinstalled the host and made the necessary config changes followed by the appropriate ceph-deploy commands (ceph-deploy install..., ceph-deploy admin..., ceph-deploy osd create...) to bring the host & it's osd's back into the cluster, - same as I would when adding a new host node to the cluster. Running ceph osd df tree shows the osd's however the host node is not displayed. Inspecting the crush map I see no host bucket has been created or any host's osd's listed. The osd's also did not start which explains the weight being 0 but I presume the osd's not starting isn't the only issue since the crush map lacks the newly installed host detail. Could anybody maybe tell me where I've gone wrong? I'm also assuming there shouldn't be an issue using the same host name again or do I manually add the host bucket and osd detail back into the crush map or should ceph-deploy not take care of that? Thanks OS: Ubuntu 16.04.3 LTS Ceph version: 12.2.1 / 12.2.2 - Luminous Kind regards Geoffrey Rhodes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding a host node back to ceph cluster
Thank you Marc, I wasn't aware that was an option so it will be very useful in future. I see for Ubuntu you can make use of debsums to verify packages. Sadly I'm still looking for a solution to my host issue though. Kind regards Geoffrey Rhodes On 15 January 2018 at 23:31, Marc Roos wrote: > > Maybe for the future: > > rpm {-V|--verify} [select-options] [verify-options] > >Verifying a package compares information about the installed > files in the package with information about the files taken >from the package metadata stored in the rpm database. Among > other things, verifying compares the size, digest, permis‐ >sions, type, owner and group of each file. Any discrepancies are > displayed. Files that were not installed from the pack‐ >age, for example, documentation files excluded on installation > using the "--excludedocs" option, will be silently ignored. > > > > > > -Original Message- > From: Geoffrey Rhodes [mailto:geoff...@rhodes.org.za] > Sent: maandag 15 januari 2018 16:39 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Adding a host node back to ceph cluster > > Good day, > > I'm having an issue re-deploying a host back into my production ceph > cluster. > Due to some bad memory (picked up by a scrub) which has been replaced I > felt the need to re-install the host to be sure no host files were > damaged. > > Prior to decommissioning the host I set the crush weight's on each osd > to 0. > Once to osd's had flushed all data I stopped the daemon's. > I then purged the osd's from the crushmap with "ceph osd purge". > Followed by "ceph osd crush rm {host}" to remove the host bucket from > the crush map. > > I also ran "ceph-deploy purge {host}" & "ceph-deploy purgedata {host}" > from the management node. > I then reinstalled the host and made the necessary config changes > followed by the appropriate ceph-deploy commands (ceph-deploy > install..., ceph-deploy admin..., ceph-deploy osd create...) to bring > the host & it's osd's back into the cluster, - same as I would when > adding a new host node to the cluster. > > Running ceph osd df tree shows the osd's however the host node is not > displayed. > Inspecting the crush map I see no host bucket has been created or any > host's osd's listed. > The osd's also did not start which explains the weight being 0 but I > presume the osd's not starting isn't the only issue since the crush map > lacks the newly installed host detail. > > Could anybody maybe tell me where I've gone wrong? > I'm also assuming there shouldn't be an issue using the same host name > again or do I manually add the host bucket and osd detail back into the > crush map or should ceph-deploy not take care of that? > > Thanks > > OS: Ubuntu 16.04.3 LTS > Ceph version: 12.2.1 / 12.2.2 - Luminous > > > Kind regards > Geoffrey Rhodes > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph OSD daemon causes network card issues
Hi Cephers, I've been having an issue since upgrading my cluster to Mimic 6 months ago (previously installed with Luminous 12.2.1). All nodes that have the same PCIe network card seem to loose network connectivity randomly. (frequency ranges from a few days to weeks per host node) The affected nodes only have the Intel 82576 LAN Card in common, different motherboards, installed drives, RAM and even PSUs. Nodes that have the Intel I350 cards are not affected by the Mimic upgrade. Each host node has recommended RAM installed and has between 4 and 6 OSDs / sata hard drives installed. The cluster operated for over a year (Luminous) without a single issue, only after the Mimic upgrade did the issues begin with these nodes. The cluster is only used for CephFS (file storage, low intensity usage) and makes use of erasure data pool (K=4, M=2). I've tested many things, different kernel versions, different Ubuntu LTS releases, re-installation and even CENTOS 7, different releases of Mimic, different igb drivers. If I stop the ceph-osd daemons the issue does not occur. If I swap out the Intel 82576 card with the Intel I350 the issue is resolved. I haven't any more ideas other than replacing the cards but I feel the issue is linked to the ceph-osd daemon and a change in the Mimic release. Below are the various software versions and drivers I've tried and a log extract from a node that lost network connectivity. - Any help or suggestions would be greatly appreciated. *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 *Ceph Version:*Mimic (currently 13.2.6) *Network card:*4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) *Driver: * igb *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x bonded (LACP) 1GB nic for private net *Log errors:* Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb :03:00.0 enp3s0f0: PCIe link lost, device now detached Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb :04:00.1 enp4s0f1: PCIe link lost, device now detached Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb :03:00.1 enp3s0f1: PCIe link lost, device now detached Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb :04:00.0 enp4s0f0: PCIe link lost, device now detached Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.4.1:6809 osd.16 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.6.1:6804 osd.20 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.7.1:6803 osd.25 since back 2019-06 -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.8.1:6803 osd.30 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.9.1:6808 osd.43 since back 2019-06 -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Kind regards Geoffrey Rhodes ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD daemon causes network card issues
[Thu Jul 18 10:36:10 2019] igb :03:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s) [Thu Jul 18 10:36:10 2019] igb :03:00.1: added PHC on eth1 [Thu Jul 18 10:36:10 2019] igb :03:00.1: Intel(R) Gigabit Ethernet Network Connection [Thu Jul 18 10:36:10 2019] igb :03:00.1: eth1: (PCIe:2.5Gb/s:Width x4) 00:25:90:eb:1c:21 [Thu Jul 18 10:36:10 2019] igb :03:00.1: eth1: PBA No: Unknown [Thu Jul 18 10:36:10 2019] igb :03:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s) [Thu Jul 18 10:36:11 2019] igb :04:00.0: added PHC on eth2 [Thu Jul 18 10:36:11 2019] igb :04:00.0: Intel(R) Gigabit Ethernet Network Connection [Thu Jul 18 10:36:11 2019] igb :04:00.0: eth2: (PCIe:2.5Gb/s:Width x4) 00:25:90:eb:1c:22 [Thu Jul 18 10:36:11 2019] igb :04:00.0: eth2: PBA No: Unknown [Thu Jul 18 10:36:11 2019] igb :04:00.0: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s) [Thu Jul 18 10:36:11 2019] igb :04:00.1: added PHC on eth3 [Thu Jul 18 10:36:11 2019] igb :04:00.1: Intel(R) Gigabit Ethernet Network Connection [Thu Jul 18 10:36:11 2019] igb :04:00.1: eth3: (PCIe:2.5Gb/s:Width x4) 00:25:90:eb:1c:23 [Thu Jul 18 10:36:11 2019] igb :04:00.1: eth3: PBA No: Unknown [Thu Jul 18 10:36:11 2019] igb :04:00.1: Using MSI-X interrupts. 8 rx queue(s), 8 tx queue(s) [Thu Jul 18 10:36:11 2019] igb :04:00.1 enp4s0f1: renamed from eth3 [Thu Jul 18 10:36:11 2019] igb :03:00.0 enp3s0f0: renamed from eth0 [Thu Jul 18 10:36:11 2019] igb :03:00.1 enp3s0f1: renamed from eth1 [Thu Jul 18 10:36:11 2019] igb :04:00.0 enp4s0f0: renamed from eth2 [Thu Jul 18 10:36:18 2019] igb :04:00.1 enp4s0f1: igb: enp4s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [Thu Jul 18 10:36:18 2019] igb :03:00.0 enp3s0f0: igb: enp3s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [Thu Jul 18 10:36:19 2019] igb :04:00.0 enp4s0f0: igb: enp4s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [Thu Jul 18 10:36:19 2019] igb :03:00.1 enp3s0f1: igb: enp3s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX cephuser@cephnode6:~$ cephuser@cephnode6:~$ uptime 14:39:44 up 4:03, 1 user, load average: 1.84, 1.59, 1.57 cephuser@cephnode6:~$ Kind regards Geoffrey Rhodes On Thu, 18 Jul 2019 at 14:02, Konstantin Shalygin wrote: > I've been having an issue since upgrading my cluster to Mimic 6 months ago > (previously installed with Luminous 12.2.1). > All nodes that have the same PCIe network card seem to loose network > connectivity randomly. (frequency ranges from a few days to weeks per host > node) > The affected nodes only have the Intel 82576 LAN Card in common, different > motherboards, installed drives, RAM and even PSUs. > Nodes that have the Intel I350 cards are not affected by the Mimic upgrade. > Each host node has recommended RAM installed and has between 4 and 6 OSDs / > sata hard drives installed. > The cluster operated for over a year (Luminous) without a single issue, > only after the Mimic upgrade did the issues begin with these nodes. > The cluster is only used for CephFS (file storage, low intensity usage) and > makes use of erasure data pool (K=4, M=2). > > I've tested many things, different kernel versions, different Ubuntu LTS > releases, re-installation and even CENTOS 7, different releases of Mimic, > different igb drivers. > If I stop the ceph-osd daemons the issue does not occur. If I swap out the > Intel 82576 card with the Intel I350 the issue is resolved. > I haven't any more ideas other than replacing the cards but I feel the > issue is linked to the ceph-osd daemon and a change in the Mimic release. > Below are the various software versions and drivers I've tried and a log > extract from a node that lost network connectivity. - Any help or > suggestions would be greatly appreciated. > > *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 > *Ceph Version:*Mimic (currently 13.2.6) > *Network card:*4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) > *Driver: * igb > *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k > *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x > bonded (LACP) 1GB nic for private net > *Log errors:* > Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb :03:00.0 > enp3s0f0: PCIe link lost, device now detached > Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb :04:00.1 > enp4s0f1: PCIe link lost, device now detached > Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb :03:00.1 > enp3s0f1: PCIe link lost, device now detached > Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb :04:00.0 > enp4s0f0: PCIe link lost, device now detached > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 hear
[ceph-users] Ceph OSD daemon possibly causes network card issues
Hi Paul, *Thanks I've run the following on all four interfaces:* sudo ethtool -K rx off tx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off ntuple off rxhash off *The following do not seen to be available to change:* Cannot change udp-fragmentation-offload Cannot change large-receive-offload Cannot change ntuple-filters Holding thumbs this helps however I still don't understand why the issue only occurs on ceph-osd nodes. ceph-mon and ceph-mds nodes and even a cech client with the same adapters do not have these issues. Kind regards Geoffrey Rhodes On Thu, 18 Jul 2019 at 18:35, Paul Emmerich wrote: > Hi, > > Intel 82576 is bad. I've seen quite a few problems with these older > igb familiy NICs, but losing the PCIe link is a new one. > I usually see them getting stuck with a message like "tx queue X hung, > resetting device..." > > Try to disable offloading features using ethtool, that sometimes helps > with the problems that I've seen. Maybe that's just a variant of the stuck > problem? > > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes > wrote: > >> Hi Cephers, >> >> I've been having an issue since upgrading my cluster to Mimic 6 months >> ago (previously installed with Luminous 12.2.1). >> All nodes that have the same PCIe network card seem to loose network >> connectivity randomly. (frequency ranges from a few days to weeks per host >> node) >> The affected nodes only have the Intel 82576 LAN Card in common, >> different motherboards, installed drives, RAM and even PSUs. >> Nodes that have the Intel I350 cards are not affected by the Mimic >> upgrade. >> Each host node has recommended RAM installed and has between 4 and 6 OSDs >> / sata hard drives installed. >> The cluster operated for over a year (Luminous) without a single issue, >> only after the Mimic upgrade did the issues begin with these nodes. >> The cluster is only used for CephFS (file storage, low intensity usage) >> and makes use of erasure data pool (K=4, M=2). >> >> I've tested many things, different kernel versions, different Ubuntu LTS >> releases, re-installation and even CENTOS 7, different releases of Mimic, >> different igb drivers. >> If I stop the ceph-osd daemons the issue does not occur. If I swap out >> the Intel 82576 card with the Intel I350 the issue is resolved. >> I haven't any more ideas other than replacing the cards but I feel the >> issue is linked to the ceph-osd daemon and a change in the Mimic release. >> Below are the various software versions and drivers I've tried and a log >> extract from a node that lost network connectivity. - Any help or >> suggestions would be greatly appreciated. >> >> *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 >> *Ceph Version:*Mimic (currently 13.2.6) >> *Network card:*4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) >> *Driver: * igb >> *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k >> *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x >> bonded (LACP) 1GB nic for private net >> *Log errors:* >> Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb :03:00.0 >> enp3s0f0: PCIe link lost, device now detached >> Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb :04:00.1 >> enp4s0f1: PCIe link lost, device now detached >> Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb :03:00.1 >> enp3s0f1: PCIe link lost, device now detached >> Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb :04:00.0 >> enp4s0f0: PCIe link lost, device now detached >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.4.1:6809 osd.16 since back 2019-06 >> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.6.1:6804 osd.20 since back 2019-06 >> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.7.1:6803 osd.25 since back 2019-06 >>
[ceph-users] Ceph OSD daemon possibly causes network card issues
Hi Konstantin, *Thanks I've run the following on all four interfaces:* sudo ethtool -K rx off tx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off ntuple off rxhash off *The following do not seen to be available to change:* Cannot change udp-fragmentation-offload Cannot change large-receive-offload Cannot change ntuple-filters Holding thumbs this helps however I still don't understand why the issue only occurs on ceph-osd nodes. ceph-mon and ceph-mds nodes and even a cech client with the same adapters do not have these issues. Kind regards Geoffrey Rhodes On Fri, 19 Jul 2019 at 05:24, Konstantin Shalygin wrote: > On 7/18/19 7:43 PM, Geoffrey Rhodes wrote: > > Sure, also attached. > > Try to disable flow control via `ethtool -K rx off tx off`. > > > > k > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com