Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs
Hi Someone here will probably lay out a detailed answer but to get you started, All the details for the osd are in the xfs partitions, mirror a new USB key and change ip etc and you should be able to recover. If the journal is linked to a /dev/sdx, make sure it's in the same spot as it was before.. All the best of luck /Josef On 25 Dec 2015 05:39, "deeepdish" wrote: > Hello, > > Had an interesting issue today. > > My OSD hosts are booting off a USB key which, you guessed it has a root > partition on there. All OSDs are mounted. My USB key failed on one of > my OSD hosts, leaving the data on OSDs inaccessible to the rest of my > cluster. I have multiple monitors running other OSD hosts where data can > be recovered to. However I’m wondering if there’s a way to “restore” / > “rebuild” the ceph install that was on this host without having all OSDs > resync again. > > Lesson learned = don’t use USB boot/root drives. However, now just > looking at what needs to be done once the OS and Ceph packages are > reinstalled. > > Thank you. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why not add (offset,len) to pglog
Thank you for your reply. I am looking formard to Sage's opinion too @sage. Also I'll keep on with the BlueStore and Kstore's progress. Regards 2015-12-25 14:48 GMT+08:00 Ning Yao : > Hi, Dong Wu, > > 1. As I currently work for other things, this proposal is abandon for > a long time > 2. This is a complicated task as we need to consider a lots such as > (not just for writeOp, as well as truncate, delete) and also need to > consider the different affects for different backends(Replicated, EC). > 3. I don't think it is good time to redo this patch now, since the > BlueStore and Kstore is inprogress, and I'm afraid to bring some > side-effect. We may prepare and propose the whole design in next CDS. > 4. Currently, we already have some tricks to deal with recovery (like > throttle the max recovery op, set the priority for recovery and so > on). So this kind of patch may not solve the critical problem but just > make things better, and I am not quite sure that this will really > bring a big improvement. Based on my previous test, it works > excellently on slow disk (say hdd), and also for a short-time > maintaining. Otherwise, it will trigger the backfill process. So wait > for Sage's opinion @sage > > If you are interest on this, we may cooperate to do this. > > Regards > Ning Yao > > > 2015-12-25 14:23 GMT+08:00 Dong Wu : >> Thanks, from this pull request I learned that this issue is not >> completed, is there any new progress of this issue? >> >> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) : >>> Yeah, This is good idea for recovery, but not for backfill. >>> @YaoNing have pull a request about this >>> https://github.com/ceph/ceph/pull/3837 this year. >>> >>> 2015-12-25 11:16 GMT+08:00 Dong Wu : Hi, I have doubt about pglog, the pglog contains (op,object,version) etc. when peering, use pglog to construct missing list,then recover the whole object in missing list even if different data among replicas is less then a whole object data(eg,4MB). why not add (offset,len) to pglog? If so, the missing list can contain (object, offset, len), then we can reduce recover data. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> -- >>> Regards, >>> Xinze Chi >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why not add (offset,len) to pglog
On Fri, 25 Dec 2015, Ning Yao wrote: > Hi, Dong Wu, > > 1. As I currently work for other things, this proposal is abandon for > a long time > 2. This is a complicated task as we need to consider a lots such as > (not just for writeOp, as well as truncate, delete) and also need to > consider the different affects for different backends(Replicated, EC). > 3. I don't think it is good time to redo this patch now, since the > BlueStore and Kstore is inprogress, and I'm afraid to bring some > side-effect. We may prepare and propose the whole design in next CDS. > 4. Currently, we already have some tricks to deal with recovery (like > throttle the max recovery op, set the priority for recovery and so > on). So this kind of patch may not solve the critical problem but just > make things better, and I am not quite sure that this will really > bring a big improvement. Based on my previous test, it works > excellently on slow disk (say hdd), and also for a short-time > maintaining. Otherwise, it will trigger the backfill process. So wait > for Sage's opinion @sage > > If you are interest on this, we may cooperate to do this. I think it's a great idea. We didn't do it before only because it is complicated. The good news is that if we can't conclusively infer exactly which parts of hte object need to be recovered from the log entry we can always just fall back to recovering the whole thing. Also, the place where this is currently most visible is RBD small writes: - osd goes down - client sends a 4k overwrite and modifies an object - osd comes back up - client sends another 4k overwrite - client io blocks while osd recovers 4mb So even if we initially ignore truncate and omap and EC and clones and anything else complicated I suspect we'll get a nice benefit. I haven't thought about this too much, but my guess is that the hard part is making the primary's missing set representation include a partial delta (say, an interval_set<> indicating which ranges of the file have changed) in a way that gracefully degrades to recovering the whole object if we're not sure. In any case, we should definitely have the design conversation! sage > > Regards > Ning Yao > > > 2015-12-25 14:23 GMT+08:00 Dong Wu : > > Thanks, from this pull request I learned that this issue is not > > completed, is there any new progress of this issue? > > > > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) : > >> Yeah, This is good idea for recovery, but not for backfill. > >> @YaoNing have pull a request about this > >> https://github.com/ceph/ceph/pull/3837 this year. > >> > >> 2015-12-25 11:16 GMT+08:00 Dong Wu : > >>> Hi, > >>> I have doubt about pglog, the pglog contains (op,object,version) etc. > >>> when peering, use pglog to construct missing list,then recover the > >>> whole object in missing list even if different data among replicas is > >>> less then a whole object data(eg,4MB). > >>> why not add (offset,len) to pglog? If so, the missing list can contain > >>> (object, offset, len), then we can reduce recover data. > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> > >> -- > >> Regards, > >> Xinze Chi > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s
Due to the nature of distributed storage and a filesystem built to distribute itself across sequential devices.. you're going to always have poor performance. Are you unable to use XFS inside the vm? If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. - Original Message - From: "J David" To: ceph-users@lists.ceph.com Sent: Thursday, December 24, 2015 1:10:36 PM Subject: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s For a variety of reasons, a ZFS pool in a QEMU/KVM virtual machine backed by a Ceph RBD doesn’t perform very well. Does anyone have any tuning tips (on either side) for this workload? A fair amount of the problem is probably related to two factors. First, ZFS always assumes it is talking to bare metal drives. This assumption is baked into it at a very fundamental level but Ceph is pretty much the polar opposite of that. For one thing, this makes any type of write caching moderately terrifying from a dataloss standpoint. Although, to be fair, we run some non-critical KVM VM’s with ZFS filesystems and cache=writeback with no observed ill-effects. From the available information, it *seems* safe to do that, but it’s not certain whether under enough stress and the wrong crash at the wrong moment, a lost/corrupted pool would be the result. ZFS is notorious for exploding if the underlying subsystem lies to it about whether data has been permanently written to disk (that bare-metal assumption again); it’s not an area that encourages pressing one’s luck. The second issue is that ZFS likes a huge recordsize. It uses small blocks for small files, but as soon as a file grows a little bit, it is happy to use 128KiB blocks (again assuming it’s talking to a physical disk that can do a sequential read of a whole block with minimal added overhead because the head was already there for the first byte and what’s a little wasted bandwidth on a 6Gbps SAS bus that has nothing else to do). Ceph on the other hand *always* has something else to do, so a 128K read-modify-write cycle to change one byte in the middle of a file winds up being punishingly wasteful. The RBD striping explanation ( on http://docs.ceph.com/docs/hammer/man/8/rbd/ ) seems to suggest that the default object size is 4M, so at least a single 128K read/write should only hit one or (at most) two objects. Whether it’s one or two seems to depend on whether ZFS has a useful interpretation of track size, which it may not. One such virtual machine reports for a 1TB ceph image, 62 sectors of 512 bytes per track, or 31K tracks. Which could lead to a fair number of object-straddling reads and writes at a 128K object size. So the main impact of that is massive write amplification; writing one byte can turn into reading and writing 128K from/to 2-6 different OSDs. All of which winds up passing over the storage LAN, introducing tons of latency compared to that hypothetical 6Gbps SAS read that ZFS is designed to expect. If it helps establish a baseline, the reason this subject comes up is that currently ZFS filesystems on RBD-backed QEMU VM’s do stuff like this: (iostat -x at 10-second intervals) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 41.00 31.3086.35 4006.40 113.22 2.07 27.18 24.10 31.22 13.82 99.92 vdc 0.00 0.00 146.30 38.10 414.95 4876.80 57.39 2.46 13.64 10.36 26.25 5.42 99.96 vdc 0.00 0.00 127.30 102.20 256.40 13081.60 116.24 2.079.198.579.97 4.35 99.88 vdc 0.00 0.00 160.80 160.70 297.30 10592.80 67.75 1.213.761.735.78 2.91 93.68 That’s… not great… for a low-load 10G LAN Ceph cluster with 60 Intel DC S37X0 SSD’s. Is there some tuning that could be done (on any side, ZFS, QEMU, or Ceph) to optimize performance? Are there any metrics we could collect to gain more insight into what and where the bottlenecks are? Some combination of changing the ZFS max recordsize, the QEMU virtual disk geometry, and Ceph backend settings seems like it might make a big difference, but there are many combinations, and it feels like guesswork with the available information. So it seems worthwhile to ask if anyone has been down this road and if so what they found before spending a week or two rediscovering the wheel. Thanks for any advice! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] nfs over rbd problem
I didn't read the whole thing but if your trying to do HA NFS, you need to run OCFS2 on your RBD and disable read/write caching on the rbd client. From: "Steve Anthony" To: ceph-users@lists.ceph.com Sent: Friday, December 25, 2015 12:39:01 AM Subject: Re: [ceph-users] nfs over rbd problem I've run into many problems trying to run RBD/NFS Pacemaker like you describe on a two node cluster. In my case, most of the problems were a result of a) no quorum and b) no STONITH. If you're going to be running this setup in production, I *highly* recommend adding more nodes (if only to maintain quorum). Without quorum, you will run into split-brain situations where you'll see anything from both nodes starting the same services to configuration changes disappearing when the nodes can't agree on a recent revision. In this case, it looks like a STONITH problem. Without a fencing mechanism, node2 cannot be 100% certain that node1 is dead. When you manually shutdown corosync, I suspect that as part of the process, node1 alerts the cluster that it's leaving as part of a planned shutdown. When it just disappears, node2 can't determine what happened, it just knows node1 isn't answering anymore. Node1 is not necessarily down, there could just be a network problem between node1 and node2. In such a scenario, it would be very bad for node2 to map/mount the RBDs and start writing data while node1 is still providing the same service. Quorum can help here too, but for node2 to be certain node1 is down, it needs an out-of-band method to force power off node1, eg. I use IPMI on a separate physical network. So...add mode nodes as quorum members. You can set weight restrictions on the resource groups to prevent them from running on these nodes if the are not as powerful as the nodes you're using now. Then, add a STONITH mechanism on all the nodes, and verify it works. Once you do that, you should see things act the way you expect. Good luck! -Steve On 12/19/2015 03:46 AM, maoqi1982 wrote: Hi list I have a test ceph cluster include 3 nodes (node0: mon, node1: osd and nfs server1, node2 osd and nfs server2). os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph version 0.94.5 I followed the http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ instructions to setup an active/standy NFS environment. when using commands " # service corosync stop or # poweroff " on node1 , the fail over switch situation went fine ( nfs server take over by node2). But when I testing the situation of cutting off the power of node1, the switch is failed. 1. [root@node1 ~]# crm status Last updated: Fri Dec 18 17:14:19 2015 Last change: Fri Dec 18 17:13:29 2015 Stack: classic openais (with plugin) Current DC: node1 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured, 3 expected votes 8 Resources configured Online: [ node1 node2 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started node1 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node1 p_export_rbd_1 (ocf::heartbeat:exportfs): Started node1 p_vip_1 (ocf::heartbeat:IPaddr): Started node1 Clone Set: clo_nfs [g_nfs] Started: [ node1 node2 ] 2. [root@node1 ~]# service corosync stop [root@node2 cluster]# crm status Last updated: Fri Dec 18 17:14:59 2015 Last change: Fri Dec 18 17:13:29 2015 Stack: classic openais (with plugin) Current DC: node2 - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 3 expected votes 8 Resources configured Online: [ node2 ] OFFLINE: [ node1 ] Resource Group: g_rbd_share_1 p_rbd_map_1 (ocf::ceph:rbd.in): Started node2 p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node2 p_export_rbd_1 (ocf::heartbeat:exportfs): Started node2 p_vip_1 (ocf::heartbeat:IPaddr): Started node2 Clone Set: clo_nfs [g_nfs] Started: [ node2 ] Stopped: [ node1 ] 3. cut off node1 power manually [root@node2 cluster]# crm status Last updated: Fri Dec 18 17:23:06 2015 Last change: Fri Dec 18 17:13:29 2015 Stack: classic openais (with plugin) Current DC: node2 - partition WITHOUT quorum Version: 1.1.11-97629de 2 Nodes configured, 3 expected votes 8 Resources configured Online: [ node2 ] OFFLINE: [ node1 ] Clone Set: clo_nfs [g_nfs] Started: [ node2 ] Stopped: [ node1 ] Failed actions: p_rbd_map_1_start_0 on node2 'unknown error' (1): call=48, status=Timed Out, last-rc-change='Fri Dec 18 17:22:19 2015', queued=0ms, exec=20002ms corosync.log: Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 668: memb=1, new=0, lost=1 Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: memb: node2 1211279552 Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: lost: node1 1194502336 Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 668: memb=1, new=0, lost=0 Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: MEMB: node2 1211279552 Dec 18 17:22:19 corosync [pcmk ]
Re: [ceph-users] How to configure if there are tow network cards in Client
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 You would look to standard Linux routing to take care of this. If NIC1 is on the same subnet as the Ceph cluster, then it will automatically work. If the NIC is not on the Ceph subnet, then you would use a static route to route traffic for the Ceph network through NIC1. You may need a corresponding route on the router to send traffic to that NIC, but I assume that is already working. There is no Ceph configuration that you need to do. I'm moving this to the ceph-users list where it is more appropriate. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Dec 25, 2015 at 7:04 AM, 蔡毅 wrote: > Hi all, > When we read the code, we haven’t find the function that the client can > bind a specific IP. In Ceph’s configuration, we could only find the parameter > “public network”, but it seems acts on the OSD but not the client. > There is a scenario that the client has two network cards named NIC1 and > NIC2. The NIC1 is responsible for communicating with cluster (monitor and > RADOS) and the NIC2 has other services except Ceph’s client. So we need the > client can bind specific IP in order to differentiate the IP communicating > with cluster from another IP serving other applications. We want to know is > there any configuration in Ceph to achieve this function? If there is, how > could we configure the IP? if not, could we add this function in Ceph? Thank > you so much. > Best regards, > Cai Yi -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWfiZ6CRDmVDuy+mK58QAAAHAQAK3Yjxbhy8UETI6z9FtD 4F/tIw+7t6w9Il3Yp3a79swDcpV5+jiRqvkx7qDmK8ZpshbSxhOyfKrtp24R qbCFTJceTNWJbgHzqc2D6RdpG4lmphviXJqMc7UUxWbXxG00yY03cVYg9X1y +i3ZMZ5goSOVY43TVF5/VMb1/IWaUUICTfiIm1R0X2fLBwP3EPRK4l81GlwF bsnuhRcF1eT0YGKaP4cASQTHDw1Osr2yRIJkVjhml+XbJTJ4ZzM5J9k+BHj3 mNXaWl/7I8mBMjShQu2hTx2xjLv/In+q93ZwOvv32W2JX8FNV3Q9a3TEcixR FAZ++bE31O0/vt5m4CscDCBZmEA09uQ4WIScrvXMgGjTs0YhK9MbZfpJd3d2 X98BKODUgptDVt9HCi+X5gyq0j03dzJM4sF30bmwPBme07J9/D9Yrs3LtqoA eJFyWR52fbcI7ESZAzo9yw3BbPwFH/6+E0K1J0x/zCE/bJMNGJs/Q+qcZTGm X9Q0vUN5gT6bMbKshAoR7XSAuxLpdrtX+x/eHQ1tZiRk+I38MczicCHHEZ+Z b3zRvWMjSqbhufOK4cM0NSVEAhN40+J2+VfxRQ81/WfG3JOAwLTNPaLDpkJF sQxEyYiKJRZO7f24g0KdNlXm62shzDjeE0pWJh7h29onWQkvYMzgQGklbWQV zYU9 =fpri -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com