Re: [ceph-users] Issue with journal on another drive
I think I got over 10% improvement when I changed from cooked journal file on btrfs based system SSD to a raw partition on the system SSD. The cluster I've been testing with is all consumer grade stuff running on top of AMD piledriver and kaveri based mobo's with the on-board SATA. My SSDs are a hodgepodge of OCZ Vertex 4 and Samsung 840 and 850 (non-pro). I'm also seeing a performance win by merging individual osds into btrfs mirror sets after doing thatand dropping the replica count from 3 to 2. I also consider this a better defense in depth strategy since btrfs self-heals when it hits bit rot on the mirrors and raid sets. That boost was probably aio and dio kicking in because of the raw versus cooked. Note that I'm running Hammer on gentoo and my current WIP is moving kernels from 3.8 to 4.0.5 everywhere. It will be interesting to see what happens with that. Regards Bill On 09/29/2015 07:32 AM, Jiri Kanicky wrote: Hi Lionel. Thank you for your reply. In this case I am considering to create separate partitions for each disk on the SSD drive. Would be good to know what is the performance difference, because creating partitions is kind of waste of space. One more question, is it a good idea to move journal for 3 OSDs to a single SSD considering if SSD fails the whole node with 3 HDDs will be down? Thinking of it, leaving journal on each OSD might be safer, because journal on one disk does not affect other disks (OSDs). Or do you think that having the journal on SSD is better trade off? Thank you Jiri On 29/09/2015 21:10, Lionel Bouton wrote: Le 29/09/2015 07:29, Jiri Kanicky a écrit : Hi, Is it possible to create journal in directory as explained here: http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#Add.2Fmove_journal_in_running_cluster Yes, the general idea (stop, flush, move, update ceph.conf, mkjournal, start) is valid for moving your journal wherever you want. That said it probably won't perform as well on a filesystem (LVM as lower overhead than a filesystem). 1. Create BTRFS over /dev/sda6 (assuming this is SSD partition alocate for journal) and mount it to /srv/ceph/journal BTRFS is probably the worst idea for hosting journals. If you must use BTRFS, you'll have to make sure that the journals are created NoCoW before the first byte is ever written to them. 2. Add OSD: ceph-deploy osd create --fs-type btrfs ceph1:sdb:/srv/ceph/journal/osd$id/journal I've no experience with ceph-deploy... Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hadoop on cephfs
Just got into a discussion today where I may have a chance to do work with a db guy who wants hadoop and I want to steer him to it on cephfs. While I'd really like to run gentoo with either infernalis or jewel (when it becomes stable in portage), odds are more likely that I will be required to use rhel/centos6.7 and thus stuck back at Hammer. Any thoughts? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop on cephfs
Actually this guy is already a fan of Hadoop. I was just wondering whether anyone has been playing around with it on top of cephfs lately. It seems like the last round of papers were from around cuttlefish. On 04/28/2016 06:21 AM, Oliver Dzombic wrote: Hi, bad idea :-) Its of course nice and important to drag developer towards a new/promising technology/software. But if the technology under the individual required specifications does not match, you will just risk to show this developer how worst this new/promising technology is. So you will just reach the opposite of what you want. So before you are doing something, usually big, like hadoop on an unstable software, maybe you should not use it. For the good of the developer, for your good and for the good of the reputation of the new/promising technology/software you wish. To force a pinguin to somehow live in the sahara, might be possible ( at least for some time ), but usually not a good idea ;-) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deploying ceph by hand: a few omissions
Actually you didn't need to do a udev rule for raw journals. Disk devices in gentoo have their group ownership set to 'disk'. I only needed to drop ceph into that in /etc/group when going from hammer to infernalis. Did you poke around any of the ceph howto's on the gentoo wiki? It's been a while since I wrote this guide when I first rolled out with firefly: https://wiki.gentoo.org/wiki/Ceph/Guide That used to be https://wiki.gentoo.org/wiki/Ceph before other people came in behind me and expanded on things I've pretty much had these bookmarks sitting around forever for adding and removing mons and osds http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ For the MDS server I think I originally went to this blog which also has other good info. http://www.sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/ On 05/01/2016 06:46 AM, Stuart Longland wrote: Hi all, This evening I was in the process of deploying a ceph cluster by hand. I did it by hand because to my knowledge, ceph-deploy doesn't support Gentoo, and my cluster here runs that. The instructions I followed are these ones: http://docs.ceph.com/docs/master/install/manual-deployment and I'm running the 10.0.2 release of Ceph: ceph version 10.0.2 (86764eaebe1eda943c59d7d784b893ec8b0c6ff9) Things went okay bootstrapping the monitors. I'm running a 3-node cluster, with OSDs and monitors co-located. Each node has a 1TB 2.5" HDD and a 40GB partition on SSD for the journal. Things went pear shaped however when I tried bootstrapping the OSDs. All was going fine until it came time to activate my first OSD. ceph-disk activate barfed because I didn't have the bootstrap-osd key. No one told me I needed to create one, or how to do it. There's a brief note about using --activate-key, but no word on what to pass as the argument. I tried passing in my admin keyring in /etc/ceph, but it didn't like that. In the end, I muddled my way through the manual OSD deployment steps, which worked fine. After correcting permissions for the ceph user, I found the OSDs came up. As an added bonus, I now know how to work around the journal permission issue at work since I've reproduced it here, using a UDEV rules file like the following: SUBSYSTEM=="block", KERNEL=="sda7", OWNER="ceph", GROUP="ceph", MODE="0600" The cluster seems to be happy enough now, but some notes on how one generates the OSD activation keys to use with `ceph-disk activate` would be a big help. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deploying ceph by hand: a few omissions
I have an active and a standby setup. The faillover takes less than a minute if you manually stop the active service. Add whatever the timeout is for the faillover to happen if things go pear shaped for the box. Things are back to letters now for mds servers. I had started with letters on firefly as recommended. Then somewhere (giant?), I was getting prodded to use numbers instead. Now with later hammer and infernalis, I'm back to getting scolded for not using letters :-) I'm holding off on jewel for the moment until I get things straightened out with the kde4 to plasma upgrade. I think that one got stablized before it was quite ready for prime time. Even then I'll probably take a good long time to backup some stuff before I try out the shiny new fsck utility. On 05/01/2016 07:13 PM, Stuart Longland wrote: Hi Bill, On 02/05/16 04:37, Bill Sharer wrote: Actually you didn't need to do a udev rule for raw journals. Disk devices in gentoo have their group ownership set to 'disk'. I only needed to drop ceph into that in /etc/group when going from hammer to infernalis. Yeah, I recall trying that on the Ubuntu-based Ceph cluster at work, and Ceph still wasn't happy, hence I've gone the route of making the partition owned by the ceph user. Did you poke around any of the ceph howto's on the gentoo wiki? It's been a while since I wrote this guide when I first rolled out with firefly: https://wiki.gentoo.org/wiki/Ceph/Guide That used to be https://wiki.gentoo.org/wiki/Ceph before other people came in behind me and expanded on things No, hadn't looked at that. I've pretty much had these bookmarks sitting around forever for adding and removing mons and osds http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/ http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ For the MDS server I think I originally went to this blog which also has other good info. http://www.sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/ That might be my next step, depending on how stable CephFS is now. One thing that has worried me is since you can only deploy one MDS, what happens if that MDS goes down? If it's simply a case of spin up another one, then fine, I can put up with a little downtime. If there's data loss though, then no, that's not good. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disk failures
This is why I use btrfs mirror sets underneath ceph and hopefully more than make up for the space loss by going with 2 replicas instead of 3 and on the fly lzo compression. The ceph deep scrubs replace any need for btrfs scrubs, but I still get the benefit of self healing when btrfs finds bit rot. The only errors I've run into are from hard shutdowns and possible ecc errors due to working with consumer hardware and memory. I've been on top of btrfs using gentoo since Firefly. Bill Sharer On 06/14/2016 09:27 PM, Christian Balzer wrote: Hello, On Tue, 14 Jun 2016 14:26:41 +0200 Jan Schermer wrote: Hi, bit rot is not "bit rot" per se - nothing is rotting on the drive platter. Never mind that I used the wrong terminology (according to Wiki) and that my long experience with "laser-rot" probably caused me to choose that term, there are data degradation scenarios that are caused by undetected media failures or by the corruption happening in the write path, thus making them quite reproducible. It occurs during reads (mostly, anyway), and it's random. You can happily read a block and get the correct data, then read it again and get garbage, then get correct data again. This could be caused by a worn out cell on SSD but firmwares look for than and rewrite it if the signal is attentuated too much. On spinners there are no cells to refresh so rewriting it doesn't help either. You can't really "look for" bit rot due to the reasons above, strong checksumming/hash verification during reads is the only solution. Which is what I've been saying in the mail below and for years on this ML. And that makes deep-scrubbing something of quite limited value. Christian And trust me, bit rot is a very real thing and very dangerous as well - do you think companies like Seagate or WD would lie about bit rot if it's not real? I'd buy a drive with BER 10^999 over one with 10^14, wouldn't everyone? And it is especially dangerous when something like Ceph handles much larger blocks of data than the client does. While the client (or an app) has some knowledge of the data _and_ hopefully throws an error if it read garbage, Ceph will (if for example snapshots are used and FIEMAP is off) actually have to read the whole object (say 4MiB) and write it elsewhere, without any knowledge whether what it read (and wrote) made any sense to the app. This way corruption might spread silently into your backups if you don't validate the data somehow (or dump it from a database for example, where it's likely to get detected). Btw just because you think you haven't seen it doesn't mean you haven't seen it - never seen artefacting in movies? Just a random bug in the decoder, is it? VoD guys would tell you... For things like databases this is somewhat less impactful - bit rot doesn't "flip a bit" but affects larger blocks of data (like one sector), so databases usually catch this during read and err instead of returning garbage to the client. Jan On 09 Jun 2016, at 09:16, Christian Balzer wrote: Hello, On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote: Il 09 giu 2016 02:09, "Christian Balzer" ha scritto: Ceph currently doesn't do any (relevant) checksumming at all, so if a PRIMARY PG suffers from bit-rot this will be undetected until the next deep-scrub. This is one of the longest and gravest outstanding issues with Ceph and supposed to be addressed with bluestore (which currently doesn't have checksum verified reads either). So if bit rot happens on primary PG, ceph is spreading the currupted data across the cluster? No. You will want to re-read the Ceph docs and the countless posts here about replication within Ceph works. http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale A client write goes to the primary OSD/PG and will not be ACK'ed to the client until is has reached all replica OSDs. This happens while the data is in-flight (in RAM), it's not read from the journal or filestore. What would be sent to the replica, the original data or the saved one? When bit rot happens I'll have 1 corrupted object and 2 good. how do you manage this between deep scrubs? Which data would be used by ceph? I think that a bitrot on a huge VM block device could lead to a mess like the whole device corrupted VM affected by bitrot would be able to stay up and running? And bitrot on a qcow2 file? Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run nor on other systems here where I (can) actually check for it. As to how it would affect things, that very much depends. If it's something like a busy directory inode that gets corrupted, the data in question will be in RAM (SLAB) and the next update will correct things. If it's a logfile, you're likely to never notice until deep-scrub detects it event
[ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY
0 log_channel(cluster) log [INF] : HEALTH_ERR; 187 pgs are stuck inac tive for more than 300 seconds; 114 pgs degraded; 138 pgs peering; 49 pgs stale; 138 pgs stuck inactive; 49 pg s stuck stale; 114 pgs stuck unclean; 114 pgs undersized; recovery 1019372/17575834 objects degraded (5.800%); too many PGs per OSD (449 > max 300); 1/4 in osds are down; noout flag(s) set; 2 mons down, quorum 0,2,3 0,3,4 2016-07-03 17:12:55.193094 7f828388e700 0 log_channel(cluster) log [INF] : monmap e10: 5 mons at {0=192.168.2.1:6789/0,1=192.168.2.2:6789/0,3=192.168.2.4:6789/0,4=192.168.2.5:6789/0,5=192.168.2.6:6789/0} 2016-07-03 17:12:55.193254 7f828388e700 0 log_channel(cluster) log [INF] : pgmap v17870289: 768 pgs: 49 stale+active+clean, 114 active+undersized+degraded, 138 peering, 467 active+clean; 7128 GB data, 16620 GB used, 4824 GB / 22356 GB avail; 1019372/17575834 objects degraded (5.800%) 2016-07-03 17:12:55.195553 7f828388e700 -1 mon/MDSMonitor.cc: In function 'bool MDSMonitor::maybe_promote_standby(std::shared_ptr)' thread 7f828388e700 time 2016-07-03 17:12:55.193360 mon/MDSMonitor.cc: 2796: FAILED assert(info.state == MDSMap::STATE_STANDBY) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x556e001f1e12] 2: (MDSMonitor::maybe_promote_standby(std::shared_ptr)+0x97f) [0x556dffede53f] 3: (MDSMonitor::tick()+0x3b6) [0x556dffee0866] 4: (MDSMonitor::on_active()+0x28) [0x556dffed9038] 5: (PaxosService::_active()+0x66a) [0x556dffe5968a] 6: (Context::complete(int)+0x9) [0x556dffe249a9] 7: (void finish_contexts(CephContext*, std::liststd::allocator >&, int)+0xac) [0x556dffe2ba7c] 8: (Paxos::finish_round()+0xd0) [0x556dffe50460] 9: (Paxos::handle_last(std::shared_ptr)+0x103d) [0x556dffe51acd] 10: (Paxos::dispatch(std::shared_ptr)+0x38c) [0x556dffe5254c] 11: (Monitor::dispatch_op(std::shared_ptr)+0xd3b) [0x556dffe2245b] 12: (Monitor::_ms_dispatch(Message*)+0x581) [0x556dffe22b91] 13: (Monitor::ms_dispatch(Message*)+0x23) [0x556dffe41393] 14: (DispatchQueue::entry()+0x7ba) [0x556e002e722a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x556e001d62cd] 16: (()+0x74a4) [0x7f8290f904a4] 17: (clone()+0x6d) [0x7f828f29298d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. If asked, I'll dump the rest of the log. Bill Sharer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY
Relevant USE flags FWIW # emerge -pv ceph These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R ~] sys-cluster/ceph-10.2.2::gentoo USE="fuse gtk jemalloc ldap libaio libatomic nss radosgw static-libs xfs -babeltrace -cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs" PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB Bill Sharer On 07/05/2016 01:45 PM, Gregory Farnum wrote: Thanks for the report; created a ticket and somebody will get on it shortly. http://tracker.ceph.com/issues/16592 -Greg On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer wrote: I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0. However now I can't get a monitor quorum going again because as soon as I get one, the mon which wins the election blows out with an assertion failure. Here's my status at the moment kroll110.2.2ceph mon.0 and ceph osd.0 normally my lead mon kroll210.2.2ceph mon 1 and ceph osd 2 kroll310.2.2ceph osd 1 kroll410.2.2ceph mon 3 and ceph osd 3 kroll510.2.2ceph mon 4 and ceph mds 2 normally my active mds kroll610.2.0ceph mon 5 and ceph mds B normally standby mds I had done rolling upgrade of everything but kroll6 and had rebooted the first three osd and mon servers. mds 2 went down during gentoo update of kroll5 because of memory scarcity so mds B was the active mds server. After rebooting kroll4 I found that mon 0 had gone done with the assertion failure. I ended up stopping all ceph processes but desktops with client mounts were all still up for the moment and basically would be stuck on locks if I tried to access cephfs. After trying to restart mons only beginning with mon 0 initially, the following happened to mon.0 after enough mons were up for a quorum: 2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log #2592390 2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table #2592397: started 2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table #2592397: 192 bytes OK 2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3 #2592388 2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0 #2592390 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to /etc/ceph/tmpmonmap 2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro cess ceph-mon, pid 20842 2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log #2592398 2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0 #2592398 2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3 #2592396 2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769923 7f8291dff480 1 mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded, format 0 -> 1 2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532 print_map e1532 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} Filesystem 'cephfs' (0) fs_name cephfs epoch 1530 flags 0 modified2016-05-19 01:21:31.953710 tableserver 0 root0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure1478 last_failure_osd_epoch 26431 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=1190233} failed damaged stopped data_pools 0 metadata_pool 1 inline_data disabled 1190233:192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 103145 Standby daemons: 1190222:192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 135114 2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776450 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776453 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776454 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776696 7f8291dff480 1 mon.0@-1(probing).paxosservice(auth 19251..1934
Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY
I noticed on that USE list that the 10.2.2 ebuild introduced a new cephfs emerge flag, so I enabled that and emerged everywhere again. The active mon is still crashing on the assertion though. Bill Sharer On 07/05/2016 08:14 PM, Bill Sharer wrote: Relevant USE flags FWIW # emerge -pv ceph These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R ~] sys-cluster/ceph-10.2.2::gentoo USE="fuse gtk jemalloc ldap libaio libatomic nss radosgw static-libs xfs -babeltrace -cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs" PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB Bill Sharer On 07/05/2016 01:45 PM, Gregory Farnum wrote: Thanks for the report; created a ticket and somebody will get on it shortly. http://tracker.ceph.com/issues/16592 -Greg On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer wrote: I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0. However now I can't get a monitor quorum going again because as soon as I get one, the mon which wins the election blows out with an assertion failure. Here's my status at the moment kroll110.2.2ceph mon.0 and ceph osd.0 normally my lead mon kroll210.2.2ceph mon 1 and ceph osd 2 kroll310.2.2ceph osd 1 kroll410.2.2ceph mon 3 and ceph osd 3 kroll510.2.2ceph mon 4 and ceph mds 2 normally my active mds kroll610.2.0ceph mon 5 and ceph mds B normally standby mds I had done rolling upgrade of everything but kroll6 and had rebooted the first three osd and mon servers. mds 2 went down during gentoo update of kroll5 because of memory scarcity so mds B was the active mds server. After rebooting kroll4 I found that mon 0 had gone done with the assertion failure. I ended up stopping all ceph processes but desktops with client mounts were all still up for the moment and basically would be stuck on locks if I tried to access cephfs. After trying to restart mons only beginning with mon 0 initially, the following happened to mon.0 after enough mons were up for a quorum: 2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log #2592390 2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table #2592397: started 2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table #2592397: 192 bytes OK 2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3 #2592388 2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0 #2592390 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to /etc/ceph/tmpmonmap 2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro cess ceph-mon, pid 20842 2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log #2592398 2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0 #2592398 2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3 #2592396 2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769923 7f8291dff480 1 mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded, format 0 -> 1 2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532 print_map e1532 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} Filesystem 'cephfs' (0) fs_name cephfs epoch 1530 flags 0 modified2016-05-19 01:21:31.953710 tableserver 0 root0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure1478 last_failure_osd_epoch 26431 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=1190233} failed damaged stopped data_pools 0 metadata_pool 1 inline_data disabled 1190233:192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 103145 Standby daemons: 1190222:192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 135114 2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776450 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25.776453 7f8291dff480 0 mon.0@-1(probing).
Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY
Manual downgrade to 10.2.0 put me back in business. I'm going to mask 10.2.2 and then try to let 10.2.1 emerge. Bill Sharer On 07/06/2016 02:16 PM, Bill Sharer wrote: I noticed on that USE list that the 10.2.2 ebuild introduced a new cephfs emerge flag, so I enabled that and emerged everywhere again. The active mon is still crashing on the assertion though. Bill Sharer On 07/05/2016 08:14 PM, Bill Sharer wrote: Relevant USE flags FWIW # emerge -pv ceph These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R ~] sys-cluster/ceph-10.2.2::gentoo USE="fuse gtk jemalloc ldap libaio libatomic nss radosgw static-libs xfs -babeltrace -cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs" PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB Bill Sharer On 07/05/2016 01:45 PM, Gregory Farnum wrote: Thanks for the report; created a ticket and somebody will get on it shortly. http://tracker.ceph.com/issues/16592 -Greg On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer wrote: I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0. However now I can't get a monitor quorum going again because as soon as I get one, the mon which wins the election blows out with an assertion failure. Here's my status at the moment kroll110.2.2ceph mon.0 and ceph osd.0 normally my lead mon kroll210.2.2ceph mon 1 and ceph osd 2 kroll310.2.2ceph osd 1 kroll410.2.2ceph mon 3 and ceph osd 3 kroll510.2.2ceph mon 4 and ceph mds 2 normally my active mds kroll610.2.0ceph mon 5 and ceph mds B normally standby mds I had done rolling upgrade of everything but kroll6 and had rebooted the first three osd and mon servers. mds 2 went down during gentoo update of kroll5 because of memory scarcity so mds B was the active mds server. After rebooting kroll4 I found that mon 0 had gone done with the assertion failure. I ended up stopping all ceph processes but desktops with client mounts were all still up for the moment and basically would be stuck on locks if I tried to access cephfs. After trying to restart mons only beginning with mon 0 initially, the following happened to mon.0 after enough mons were up for a quorum: 2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log #2592390 2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table #2592397: started 2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table #2592397: 192 bytes OK 2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3 #2592388 2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0 #2592390 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to /etc/ceph/tmpmonmap 2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro cess ceph-mon, pid 20842 2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log #2592398 2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0 #2592398 2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3 #2592396 2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769923 7f8291dff480 1 mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded, format 0 -> 1 2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532 print_map e1532 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} Filesystem 'cephfs' (0) fs_name cephfs epoch 1530 flags 0 modified2016-05-19 01:21:31.953710 tableserver 0 root0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure1478 last_failure_osd_epoch 26431 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=1190233} failed damaged stopped data_pools 0 metadata_pool 1 inline_data disabled 1190233:192.168.2.6:6800/5437 'B' mds.0.1526 up:active seq 103145 Standby daemons: 1190222:192.168.2.5:6801/5871 '2' mds.-1.0 up:standby seq 135114 2016-07-03 17:09:25.776444 7f8291dff480 0 mon.0@-1(probing).osd e26460 crush map has features 2200130813952, adjusting msgr requires 2016-07-03 17:09:25
Re: [ceph-users] Active MON aborts on Jewel 10.2.2 with FAILED assert(info.state == MDSMap::STATE_STANDBY
Just for giggles I tried the rolling upgrade to 10.2.2 again today. This time I rolled mon.0 and osd.0 first while keeping the mds servers up and then rolled them before moving on to the other three. No assertion failure this time since I guess I always had an mds active. I wonder if I will have a problem though if I do a complete cold start of the cluster. Bill Sharer On 07/06/2016 04:19 PM, Bill Sharer wrote: Manual downgrade to 10.2.0 put me back in business. I'm going to mask 10.2.2 and then try to let 10.2.1 emerge. Bill Sharer On 07/06/2016 02:16 PM, Bill Sharer wrote: I noticed on that USE list that the 10.2.2 ebuild introduced a new cephfs emerge flag, so I enabled that and emerged everywhere again. The active mon is still crashing on the assertion though. Bill Sharer On 07/05/2016 08:14 PM, Bill Sharer wrote: Relevant USE flags FWIW # emerge -pv ceph These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R ~] sys-cluster/ceph-10.2.2::gentoo USE="fuse gtk jemalloc ldap libaio libatomic nss radosgw static-libs xfs -babeltrace -cephfs -cryptopp -debug -lttng -tcmalloc {-test} -zfs" PYTHON_TARGETS="python2_7 python3_4 -python3_5" 11,271 KiB Bill Sharer On 07/05/2016 01:45 PM, Gregory Farnum wrote: Thanks for the report; created a ticket and somebody will get on it shortly. http://tracker.ceph.com/issues/16592 -Greg On Sun, Jul 3, 2016 at 5:55 PM, Bill Sharer wrote: I was working on a rolling upgrade on Gentoo to Jewel 10.2.2 from 10.2.0. However now I can't get a monitor quorum going again because as soon as I get one, the mon which wins the election blows out with an assertion failure. Here's my status at the moment kroll110.2.2ceph mon.0 and ceph osd.0 normally my lead mon kroll210.2.2ceph mon 1 and ceph osd 2 kroll310.2.2ceph osd 1 kroll410.2.2ceph mon 3 and ceph osd 3 kroll510.2.2ceph mon 4 and ceph mds 2 normally my active mds kroll610.2.0ceph mon 5 and ceph mds B normally standby mds I had done rolling upgrade of everything but kroll6 and had rebooted the first three osd and mon servers. mds 2 went down during gentoo update of kroll5 because of memory scarcity so mds B was the active mds server. After rebooting kroll4 I found that mon 0 had gone done with the assertion failure. I ended up stopping all ceph processes but desktops with client mounts were all still up for the moment and basically would be stuck on locks if I tried to access cephfs. After trying to restart mons only beginning with mon 0 initially, the following happened to mon.0 after enough mons were up for a quorum: 2016-07-03 16:34:26.555728 7fbff22f8480 1 leveldb: Recovering log #2592390 2016-07-03 16:34:26.555762 7fbff22f8480 1 leveldb: Level-0 table #2592397: started 2016-07-03 16:34:26.558788 7fbff22f8480 1 leveldb: Level-0 table #2592397: 192 bytes OK 2016-07-03 16:34:26.562263 7fbff22f8480 1 leveldb: Delete type=3 #2592388 2016-07-03 16:34:26.562364 7fbff22f8480 1 leveldb: Delete type=0 #2592390 2016-07-03 16:34:26.563126 7fbff22f8480 -1 wrote monmap to /etc/ceph/tmpmonmap 2016-07-03 17:09:25.753729 7f8291dff480 0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), pro cess ceph-mon, pid 20842 2016-07-03 17:09:25.762588 7f8291dff480 1 leveldb: Recovering log #2592398 2016-07-03 17:09:25.767722 7f8291dff480 1 leveldb: Delete type=0 #2592398 2016-07-03 17:09:25.767803 7f8291dff480 1 leveldb: Delete type=3 #2592396 2016-07-03 17:09:25.768600 7f8291dff480 0 starting mon.0 rank 0 at 192.168.2.1:6789/0 mon_data /var/lib/ceph/mon/ceph-0 fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769066 7f8291dff480 1 mon.0@-1(probing) e10 preinit fsid 1798897a-f0c9-422d-86b3-d4933a12c7ac 2016-07-03 17:09:25.769923 7f8291dff480 1 mon.0@-1(probing).paxosservice(pgmap 17869652..17870289) refresh upgraded, format 0 -> 1 2016-07-03 17:09:25.769947 7f8291dff480 1 mon.0@-1(probing).pg v0 on_upgrade discarding in-core PGMap 2016-07-03 17:09:25.776148 7f8291dff480 0 mon.0@-1(probing).mds e1532 print_map e1532 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} Filesystem 'cephfs' (0) fs_name cephfs epoch 1530 flags 0 modified2016-05-19 01:21:31.953710 tableserver 0 root0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure1478 last_failure_osd_epoch 26431 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=1
Re: [ceph-users] ONE pg deep-scrub blocks cluster
I suspect the data for one or more shards on this osd's underlying filesystem has a marginally bad sector or sectors. A read from the deep scrub may be causing the drive to perform repeated seeks and reads of the sector until it gets a good read from the filesystem. You might want to look at the SMART info on the drive or drives in the RAID set to see what the error counts suggest about this. You may also be looking at a drive that's about to fail. Bill Sharer On 07/28/2016 08:46 AM, c wrote: Hello Ceph alikes :) i have a strange issue with one PG (0.223) combined with "deep-scrub". Always when ceph - or I manually - run a " ceph pg deep-scrub 0.223 ", this leads to many "slow/block requests" so that nearly all of my VMs stop working for a while. This happens only to this one PG 0.223 and in combination with deep-scrub (!). All other Placement Groups where a deep-scrub occurs are fine. The mentioned PG also works fine when a "normal scrub" occurs. These OSDs are involved: #> ceph pg map 0.223 osdmap e7047 pg 0.223 (0.223) -> up [4,16,28] acting [4,16,28] *The LogFiles* "deep-scrub" starts @ 2016-07-28 12:44:00.588542 and takes approximately 12 Minutes (End: 2016-07-28 12:56:31.891165) - ceph.log: http://pastebin.com/FSY45VtM I have done " ceph tell osd injectargs '--debug-osd = 5/5' " for the related OSDs 4,16 and 28 LogFile - osd.4 - ceph-osd.4.log: http://slexy.org/view/s20zzAfxFH LogFile - osd.16 - ceph-osd.16.log: http://slexy.org/view/s25H3Zvkb0 LogFile - osd.28 - ceph-osd.28.log: http://slexy.org/view/s21Ecpwd70 I have checked the disks 4,16 and 28 with smartctl and could not any issues - also there are no odd "dmesg" messages. *ceph -s* cluster 98a410bf-b823-47e4-ad17-4543afa24992 health HEALTH_OK monmap e2: 3 mons at {monitor1=172.16.0.2:6789/0,monitor3=172.16.0.4:6789/0,monitor2=172.16.0.3:6789/0} election epoch 38, quorum 0,1,2 monitor1,monitor2,monitor3 osdmap e7047: 30 osds: 30 up, 30 in flags sortbitwise pgmap v3253519: 1024 pgs, 1 pools, 2858 GB data, 692 kobjects 8577 GB used, 96256 GB / 102 TB avail 1024 active+clean client io 396 kB/s rd, 3141 kB/s wr, 55 op/s rd, 269 op/s wr This is my Setup: *Software/OS* - Jewel #> ceph tell osd.* version | grep version | uniq "version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)" #> ceph tell mon.* version [...] ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) - Ubuntu 16.04 LTS on all OSD and MON Server #> uname -a Linux galawyn 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux *Server* 3x OSD Server, each with - 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no Hyper-Threading - 64GB RAM - 10x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for 10-12 Disks - 1x Samsung SSD 840/850 Pro only for the OS 3x MON Server - Two of them with 1x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (4 Cores, 8 Threads) - The third one has 2x Intel(R) Xeon(R) CPU L5430 @ 2.66GHz ==> 8 Cores, no Hyper-Threading - 32 GB RAM - 1x Raid 10 (4 Disks) *Network* - Each Server and Client has an active connection @ 1x 10GB; A second connection is also connected via 10GB but provides only a Backup connection when the active Switch fails - no LACP possible. - We do not use Jumbo Frames yet.. - Public and Cluster-Network related Ceph traffic is going through this one active 10GB Interface on each Server. Any ideas what is going on? Can I provide more input to find a solution? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ONE pg deep-scrub blocks cluster
Removing osd.4 and still getting the scrub problems removes its drive from consideration as the culprit. Try the same thing again for osd.16 and then osd.28. smartctl may not show anything out of sorts until the marginally bad sector or sectors finally goes bad and gets remapped. The only hint may be buried in the raw read error rate, seek error rate or other error counts like ecc or crc errors. The long test you are running may or may not show any new information. Bill Sharer On 07/28/2016 11:46 AM, c wrote: Am 2016-07-28 15:26, schrieb Bill Sharer: I suspect the data for one or more shards on this osd's underlying filesystem has a marginally bad sector or sectors. A read from the deep scrub may be causing the drive to perform repeated seeks and reads of the sector until it gets a good read from the filesystem. You might want to look at the SMART info on the drive or drives in the RAID set to see what the error counts suggest about this. You may also be looking at a drive that's about to fail. Bill Sharer Hello Bill, thank you for reading and answering my eMail :) As I wrote, I have already checked the disks via "smartctl" - osd.4: http://slexy.org/view/s2LR5ncr8G - osd.16: http://slexy.org/view/s2LH6FBcYP - osd.28: http://slexy.org/view/s21Yod9dUw Now there is running a long test " smartctl --test long /dev/DISK " on all disks - to be really on the safe side. This will take a while. There is no RAID used for the OSDs! I have forgot to mention that for a test I had removed (completely) "osd.4" from the Cluster and did run " ceph pg deep-scrub 0.223 " again with the same result (nearly all of my VMs stop working for a while). - Mehmet On 07/28/2016 08:46 AM, c wrote: Hello Ceph alikes :) i have a strange issue with one PG (0.223) combined with "deep-scrub". Always when ceph - or I manually - run a " ceph pg deep-scrub 0.223 ", this leads to many "slow/block requests" so that nearly all of my VMs stop working for a while. This happens only to this one PG 0.223 and in combination with deep-scrub (!). All other Placement Groups where a deep-scrub occurs are fine. The mentioned PG also works fine when a "normal scrub" occurs. These OSDs are involved: #> ceph pg map 0.223 osdmap e7047 pg 0.223 (0.223) -> up [4,16,28] acting [4,16,28] *The LogFiles* "deep-scrub" starts @ 2016-07-28 12:44:00.588542 and takes approximately 12 Minutes (End: 2016-07-28 12:56:31.891165) - ceph.log: http://pastebin.com/FSY45VtM I have done " ceph tell osd injectargs '--debug-osd = 5/5' " for the related OSDs 4,16 and 28 LogFile - osd.4 - ceph-osd.4.log: http://slexy.org/view/s20zzAfxFH LogFile - osd.16 - ceph-osd.16.log: http://slexy.org/view/s25H3Zvkb0 LogFile - osd.28 - ceph-osd.28.log: http://slexy.org/view/s21Ecpwd70 I have checked the disks 4,16 and 28 with smartctl and could not any issues - also there are no odd "dmesg" messages. *ceph -s* cluster 98a410bf-b823-47e4-ad17-4543afa24992 health HEALTH_OK monmap e2: 3 mons at {monitor1=172.16.0.2:6789/0,monitor3=172.16.0.4:6789/0,monitor2=172.16.0.3:6789/0} election epoch 38, quorum 0,1,2 monitor1,monitor2,monitor3 osdmap e7047: 30 osds: 30 up, 30 in flags sortbitwise pgmap v3253519: 1024 pgs, 1 pools, 2858 GB data, 692 kobjects 8577 GB used, 96256 GB / 102 TB avail 1024 active+clean client io 396 kB/s rd, 3141 kB/s wr, 55 op/s rd, 269 op/s wr This is my Setup: *Software/OS* - Jewel #> ceph tell osd.* version | grep version | uniq "version": "ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)" #> ceph tell mon.* version [...] ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) - Ubuntu 16.04 LTS on all OSD and MON Server #> uname -a Linux galawyn 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux *Server* 3x OSD Server, each with - 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no Hyper-Threading - 64GB RAM - 10x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling Device for 10-12 Disks - 1x Samsung SSD 840/850 Pro only for the OS 3x MON Server - Two of them with 1x Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz (4 Cores, 8 Threads) - The third one has 2x Intel(R) Xeon(R) CPU L5430 @ 2.66GHz ==> 8 Cores, no Hyper-Threading - 32 GB RAM - 1x Raid 10 (4 Disks) *Network* - Each Server and Client has an active connection @ 1x 10GB; A second connection is also connected via 10GB but provides only a Backup connection when the active Switch fails - no LACP possible. - We do not use Jumbo Frames yet.. - Public and Cluster-Network related Ceph traffic is going through this one active 10GB Interface on each Server. Any i
Re: [ceph-users] what happen to the OSDs if the OS disk dies?
If all the system disk does is handle the o/s (ie osd journals are on dedicated or osd drives as well), no problem. Just rebuild the system and copy the ceph.conf back in when you re-install ceph. Keep a spare copy of your original fstab to keep your osd filesystem mounts straight. Just keep in mind that you are down 11 osds while that system drive gets rebuilt though. It's safer to do 10 osds and then have a mirror set for the system disk. Bill Sharer On 08/12/2016 03:33 PM, Ronny Aasen wrote: On 12.08.2016 13:41, Félix Barbeira wrote: Hi, I'm planning to make a ceph cluster but I have a serious doubt. At this moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official ceph docs says: "We recommend using a dedicated drive for the operating system and software, and one drive for each Ceph OSD Daemon you run on the host." I could use for example 1 disk for the OS and 11 for OSD data. In the operating system I would run 11 daemons to control the OSDs. But...what happen to the cluster if the disk with the OS fails?? maybe the cluster thinks that 11 OSD failed and try to replicate all that data over the cluster...that sounds no good. Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting" 8TB only for ~10GB that the OS needs. In all the docs that i've been reading says ceph has no unique single point of failure, so I think that this scenario must have a optimal solution, maybe somebody could help me. Thanks in advance. -- Félix Barbeira. if you do not have dedicated slots on the back for OS disks, then i would recomend using SATADOM flash modules directly into a SATA port internal in the machine. Saves you 2 slots for osd's and they are quite reliable. you could even use 2 sd cards if your machine have the internal SD slot http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] assertion error trying to start mds server
I've been in the process of updating my gentoo based cluster both with new hardware and a somewhat postponed update. This includes some major stuff including the switch from gcc 4.x to 5.4.0 on existing hardware and using gcc 6.4.0 to make better use of AMD Ryzen on the new hardware. The existing cluster was on 10.2.2, but I was going to 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin transitioning to bluestore on the osd's. The Ryzen units are slated to be bluestore based OSD servers if and when I get to that point. Up until the mds failure, they were simply cephfs clients. I had three OSD servers updated to 10.2.7-r1 (one is also a MON) and had two servers left to update. Both of these are also MONs and were acting as a pair of dual active MDS servers running 10.2.2. Monday morning I found out the hard way that an UPS one of them was on has a dead battery. After I fsck'd and came back up, I saw the following assertion error when it was trying to start it's mds.B server: mdsbeacon(64162/B up:replay seq 3 v4699) v7 126+0+0 (709014160 0 0) 0x7f6fb4001bc0 con 0x55f94779d 8d0 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In function 'virtual void EImportStart::r eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55f93d64a122] 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] 5: (()+0x74a4) [0x7f6fd009b4a4] 6: (clone()+0x6d) [0x7f6fce5a598d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mds.B.log When I was googling around, I ran into this Cern presentation and tried out the offline backware scrubbing commands on slide 25 first: https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf Both ran without any messages, so I'm assuming I have sane contents in the cephfs_data and cephfs_metadata pools. Still no luck getting things restarted, so I tried the cephfs-journal-tool journal reset on slide 23. That didn't work either. Just for giggles, I tried setting up the two Ryzen boxes as new mds.C and mds.D servers which would run on 10.2.7-r1 instead of using mds.A and mds.B (10.2.2). The D server fails with the same assert as follows: === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con 0x7fffe0013310 0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7fffd99f5700 time 2017-10-09 13:01:31.570608 mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b7ebc8] 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x55a5674a] 3: (MDLog::_replay_thread()+0xe51) [0x559cef21] 4: (MDLog::ReplayThread::entry()+0xd) [0x557778cd] 5: (()+0x7364) [0x77bc5364] 6: (clone()+0x6d) [0x76051ccd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] assertion error trying to start mds server
I was wondering if I can't get the second mds back up That offline backward scrub check sounds like it should be able to also salvage what it can of the two pools to a normal filesystem. Is there an option for that or has someone written some form of salvage tool? On 10/11/2017 07:07 AM, John Spray wrote: > On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer wrote: >> I've been in the process of updating my gentoo based cluster both with >> new hardware and a somewhat postponed update. This includes some major >> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware >> and using gcc 6.4.0 to make better use of AMD Ryzen on the new >> hardware. The existing cluster was on 10.2.2, but I was going to >> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin >> transitioning to bluestore on the osd's. >> >> The Ryzen units are slated to be bluestore based OSD servers if and when >> I get to that point. Up until the mds failure, they were simply cephfs >> clients. I had three OSD servers updated to 10.2.7-r1 (one is also a >> MON) and had two servers left to update. Both of these are also MONs >> and were acting as a pair of dual active MDS servers running 10.2.2. >> Monday morning I found out the hard way that an UPS one of them was on >> has a dead battery. After I fsck'd and came back up, I saw the >> following assertion error when it was trying to start it's mds.B server: >> >> >> mdsbeacon(64162/B up:replay seq 3 v4699) v7 126+0+0 (709014160 >> 0 0) 0x7f6fb4001bc0 con 0x55f94779d >> 8d0 >> 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In >> function 'virtual void EImportStart::r >> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 >> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) >> >> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x82) [0x55f93d64a122] >> 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] >> 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] >> 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] >> 5: (()+0x74a4) [0x7f6fd009b4a4] >> 6: (clone()+0x6d) [0x7f6fce5a598d] >> NOTE: a copy of the executable, or `objdump -rdS ` is >> needed to interpret this. >> >> --- logging levels --- >>0/ 5 none >>0/ 1 lockdep >>0/ 1 context >>1/ 1 crush >>1/ 5 mds >>1/ 5 mds_balancer >>1/ 5 mds_locker >>1/ 5 mds_log >>1/ 5 mds_log_expire >>1/ 5 mds_migrator >>0/ 1 buffer >>0/ 1 timer >>0/ 1 filer >>0/ 1 striper >>0/ 1 objecter >>0/ 5 rados >>0/ 5 rbd >>0/ 5 rbd_mirror >>0/ 5 rbd_replay >>0/ 5 journaler >>0/ 5 objectcacher >>0/ 5 client >>0/ 5 osd >>0/ 5 optracker >>0/ 5 objclass >>1/ 3 filestore >>1/ 3 journal >>0/ 5 ms >>1/ 5 mon >>0/10 monc >>1/ 5 paxos >>0/ 5 tp >>1/ 5 auth >>1/ 5 crypto >>1/ 1 finisher >>1/ 5 heartbeatmap >>1/ 5 perfcounter >>1/ 5 rgw >>1/10 civetweb >>1/ 5 javaclient >>1/ 5 asok >>1/ 1 throttle >>0/ 0 refs >>1/ 5 xio >>1/ 5 compressor >>1/ 5 newstore >>1/ 5 bluestore >>1/ 5 bluefs >>1/ 3 bdev >>1/ 5 kstore >>4/ 5 rocksdb >>4/ 5 leveldb >>1/ 5 kinetic >>1/ 5 fuse >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 1 >> max_new 1000 >> log_file /var/log/ceph/ceph-mds.B.log >> >> >> >> When I was googling around, I ran into this Cern presentation and tried >> out the offline backware scrubbing commands on slide 25 first: >> >> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf >> >> >> Both ran without any messages, so I'm assuming I have sane contents in >> the cephfs_data and cephfs_metadata pools. Still no luck getting things >> restarted, so I tried the cephfs-journal-tool journal reset on slide >> 23. That didn't work either. Just for giggles, I tried setting up the >> two Ryzen boxes as new mds.C and mds.D servers which would run on >> 10.2.7-r1 instead of using mds.A and mds.B (10.2.2). The D server fails >> with the
Re: [ceph-users] assertion error trying to start mds server
After your comment about the dual mds servers I decided to just give up trying to get the second restarted. After eyeballing what I had on one of the new Ryzen boxes for drive space, I decided to just dump the filesystem. That will also make things go faster if and when I flip everything over to bluestore. So far so good... I just took a peek and saw the files being owned by Mr root though. Is there going to be an ownership reset at some point or will I have to resolve that by hand? On 10/12/2017 06:09 AM, John Spray wrote: > On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer wrote: >> I was wondering if I can't get the second mds back up That offline >> backward scrub check sounds like it should be able to also salvage what >> it can of the two pools to a normal filesystem. Is there an option for >> that or has someone written some form of salvage tool? > Yep, cephfs-data-scan can do that. > > To scrape the files out of a CephFS data pool to a local filesystem, do this: > cephfs-data-scan scan_extents # this is discovering > all the file sizes > cephfs-data-scan scan_inodes --output-dir /tmp/my_output > > The time taken by both these commands scales linearly with the number > of objects in your data pool. > > This tool may not see the correct filename for recently created files > (any file whose metadata is in the journal but not flushed), these > files will go into a lost+found directory, named after their inode > number. > > John > >> On 10/11/2017 07:07 AM, John Spray wrote: >>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer wrote: >>>> I've been in the process of updating my gentoo based cluster both with >>>> new hardware and a somewhat postponed update. This includes some major >>>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware >>>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new >>>> hardware. The existing cluster was on 10.2.2, but I was going to >>>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin >>>> transitioning to bluestore on the osd's. >>>> >>>> The Ryzen units are slated to be bluestore based OSD servers if and when >>>> I get to that point. Up until the mds failure, they were simply cephfs >>>> clients. I had three OSD servers updated to 10.2.7-r1 (one is also a >>>> MON) and had two servers left to update. Both of these are also MONs >>>> and were acting as a pair of dual active MDS servers running 10.2.2. >>>> Monday morning I found out the hard way that an UPS one of them was on >>>> has a dead battery. After I fsck'd and came back up, I saw the >>>> following assertion error when it was trying to start it's mds.B server: >>>> >>>> >>>> mdsbeacon(64162/B up:replay seq 3 v4699) v7 126+0+0 (709014160 >>>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d >>>> 8d0 >>>> 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In >>>> function 'virtual void EImportStart::r >>>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 >>>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) >>>> >>>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) >>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>> const*)+0x82) [0x55f93d64a122] >>>> 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] >>>> 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] >>>> 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] >>>> 5: (()+0x74a4) [0x7f6fd009b4a4] >>>> 6: (clone()+0x6d) [0x7f6fce5a598d] >>>> NOTE: a copy of the executable, or `objdump -rdS ` is >>>> needed to interpret this. >>>> >>>> --- logging levels --- >>>>0/ 5 none >>>>0/ 1 lockdep >>>>0/ 1 context >>>>1/ 1 crush >>>>1/ 5 mds >>>>1/ 5 mds_balancer >>>>1/ 5 mds_locker >>>>1/ 5 mds_log >>>>1/ 5 mds_log_expire >>>>1/ 5 mds_migrator >>>>0/ 1 buffer >>>>0/ 1 timer >>>>0/ 1 filer >>>>0/ 1 striper >>>>0/ 1 objecter >>>>0/ 5 rados >>>>0/ 5 rbd >>>>0/ 5 rbd_mirror >>>>0/ 5 rbd_replay >>>>0/ 5 journaler >>>>0/ 5 objectcacher >>>>0/ 5 client >>>>0/
[ceph-users] Infernalis OSD errored out on journal permissions without mentioning anything in its log
This took a little head scratching until I figured out why my osd daemons were not restarting under Infernalis on Gentoo. I had just upgraded from Hammer to Infernalis and had reset ownership from root:root to ceph:ceph on the files of each OSD in /var/lib/ceph/osd/ceph-n. However I forgot to take into account the ownership on the journals which I have set up as raw partitions. Under Gentoo, I needed to put the ceph user into the "disk" group to allow it to have write access to the device files. The osd startup init script started the osd with ok status but the actual process would exit without writing anything to its /var/log/ceph/ceph-osd.n.log. I would have thought there might have been some sort of permission error logged, but nope :-) Bill Sharer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] clock skew
If you are just synching to the outside pool, the three hosts may end up latching on to different outside servers as their definitive sources. You might want to make one of the three a higher priority source to the other two and possibly just have it use the outside sources as sync. Also for hardware newer than about five years old, you might want to look at enabling the NIC clocks using LinuxPTP to keep clock jitter down inside your LAN. I wrote this article on the Gentoo wiki on enabling PTP in chrony. https://wiki.gentoo.org/wiki/Chrony_with_hardware_timestamping Bill Sharer On 4/25/19 6:33 AM, mj wrote: Hi all, On our three-node cluster, we have setup chrony for time sync, and even though chrony reports that it is synced to ntp time, at the same time ceph occasionally reports time skews that can last several hours. See for example: root@ceph2:~# ceph -v ceph version 12.2.10 (fc2b1783e3727b66315cc667af9d663d30fe7ed4) luminous (stable) root@ceph2:~# ceph health detail HEALTH_WARN clock skew detected on mon.1 MON_CLOCK_SKEW clock skew detected on mon.1 mon.1 addr 10.10.89.2:6789/0 clock skew 0.506374s > max 0.5s (latency 0.000591877s) root@ceph2:~# chronyc tracking Reference ID : 7F7F0101 () Stratum : 10 Ref time (UTC) : Wed Apr 24 19:05:28 2019 System time : 0.00133 seconds slow of NTP time Last offset : -0.00524 seconds RMS offset : 0.00524 seconds Frequency : 12.641 ppm slow Residual freq : +0.000 ppm Skew : 0.000 ppm Root delay : 0.00 seconds Root dispersion : 0.00 seconds Update interval : 1.4 seconds Leap status : Normal root@ceph2:~# For the record: mon.1 = ceph2 = 10.10.89.2, and time is synced similarly with NTP on the two other nodes. We don't understand this... I have now injected mon_clock_drift_allowed 0.7, so at least we have HEALTH_OK again. (to stop upsetting my monitoring system) But two questions: - can anyone explain why this is happening, is it looks as if ceph and NTP/chrony disagree on just how time-synced the servers are..? - how to determine the current clock skew from cephs perspective? Because "ceph health detail" in case of HEALTH_OK does not show it. (I want to start monitoring it continuously, to see if I can find some sort of pattern) Thanks! MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph + SAMBA (vfs_ceph)
Your windows client is failing to authenticate when it tries to mount the share. That could be a simple fix or hideously complicated depending on what type of Windows network you are running in. Is this lab environment using a Windows server running as an Active Directory Domain controller or have you just been working with standalone installs of Linux and Windows in your lab? Are your windows installs simply based on a retail version of Windows Home or do you have the Pro or Enterprise versions licensed? If you are stuck with a Home only version or simply want to do ad-hoc stuff without much futher ado (probably why you have SECURITY=USER stanza in your conf) then just look at using smbpasswd to create the password hashes necessary for SMB mounting. This is necessary because Windows and Unix/Linux have different hashing schemes. This samba wiki link will probably be a good starting point for you. https://wiki.samba.org/index.php/Setting_up_Samba_as_a_Standalone_Server If you are an Active Directory network, you will end up mucking around in a lot more config files in order to get your Linux boxes to join the Directory as members and then authenticate against the domain controllers. That can also be a somewhat simple thing, but it can get hairy if your organization has infosec in mind and has hardening procedures that they applied. That's when you might be breaking out Wireshark and analyzing the exchanges between Linux and the dc to figure out what sort of insanity is going on in your IT department. If you aren't the domain admin or aren't good friends with one who also knows Unix/Linux you may never get anywhere. Bill Sharer On 8/28/19 2:32 PM, Salsa wrote: This is the result: # testparm -s Load smb config files from /etc/samba/smb.conf rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384) Processing section "[homes]" Processing section "[cephfs]" Processing section "[printers]" Processing section "[print$]" Loaded services file OK. Server role: ROLE_STANDALONE # Global parameters [global] load printers = No netbios name = SAMBA-CEPH printcap name = cups security = USER workgroup = CEPH smbd: backgroundqueue = no idmap config * : backend = tdb cups options = raw valid users = samba ... [cephfs] create mask = 0777 directory mask = 0777 guest ok = Yes guest only = Yes kernel share modes = No path = / read only = No vfs objects = ceph ceph: user_id = samba ceph:config_file = /etc/ceph/ceph.conf I cut off some parts I thought were not relevant. -- Salsa Sent with ProtonMail <https://protonmail.com> Secure Email. ‐‐‐ Original Message ‐‐‐ On Wednesday, August 28, 2019 3:09 AM, Konstantin Shalygin wrote: I'm running a ceph installation on a lab to evaluate for production and I have a cluster running, but I need to mount on different windows servers and desktops. I created an NFS share and was able to mount it on my Linux desktop, but not a Win 10 desktop. Since it seems that Windows server 2016 is required to mount the NFS share I quit that route and decided to try samba. I compiled a version of Samba that has this vfs_ceph module, but I can't set it up correctly. It seems I'm missing some user configuration as I've hit this error: " ~$ smbclient -U samba.gw //10.17.6.68/cephfs_a WARNING: The "syslog" option is deprecated Enter WORKGROUP\samba.gw's password: session setup failed: NT_STATUS_LOGON_FAILURE " Does anyone know of any good setup tutorial to follow? This is my smb config so far: # Global parameters [global] load printers = No netbios name = SAMBA-CEPH printcap name = cups security = USER workgroup = CEPH smbd: backgroundqueue = no idmap config * : backend = tdb cups options = raw valid users = samba [cephfs] create mask = 0777 directory mask = 0777 guest ok = Yes guest only = Yes kernel share modes = No path = / read only = No vfs objects = ceph ceph: user_id = samba ceph:config_file = /etc/ceph/ceph.conf Thanks Your configuration seems correct, but conf have or don't have special characters such a spaces, lower case options. First what you should do is run `testparm -s` and paste here what in output. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com