date:20151225

Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

2015-12-25 Thread Josef Johansson

Hi

Someone here will probably lay out a detailed answer but to get you started,

All the details for the osd are in the xfs partitions, mirror a new USB key
and change ip etc and you should be able to recover.

If the journal is linked to a /dev/sdx, make sure it's in the same spot as
it was before..

All the best of luck
/Josef
On 25 Dec 2015 05:39, "deeepdish"  wrote:

> Hello,
>
> Had an interesting issue today.
>
> My OSD hosts are booting off a USB key which, you guessed it has a root
> partition on there.   All OSDs are mounted.   My USB key failed on one of
> my OSD hosts, leaving the data on OSDs inaccessible to the rest of my
> cluster.   I have multiple monitors running other OSD hosts where data can
> be recovered to.   However I’m wondering if there’s a way to “restore” /
> “rebuild” the ceph install that was on this host without having all OSDs
> resync again.
>
> Lesson learned = don’t use USB boot/root drives.   However, now just
> looking at what needs to be done once the OS and Ceph packages are
> reinstalled.
>
> Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Dong Wu

Thank you for your reply. I am looking formard to Sage's opinion too @sage.
Also I'll keep on with the BlueStore and Kstore's progress.

Regards

2015-12-25 14:48 GMT+08:00 Ning Yao :
> Hi, Dong Wu,
>
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
>
> If you are interest on this, we may cooperate to do this.
>
> Regards
> Ning Yao
>
>
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
>> Thanks, from this pull request I learned that this issue is not
>> completed, is there any new progress of this issue?
>>
>> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) :
>>> Yeah, This is good idea for recovery, but not for backfill.
>>> @YaoNing have pull a request about this
>>> https://github.com/ceph/ceph/pull/3837 this year.
>>>
>>> 2015-12-25 11:16 GMT+08:00 Dong Wu :
 Hi,
 I have doubt about pglog, the pglog contains (op,object,version) etc.
 when peering, use pglog to construct missing list,then recover the
 whole object in missing list even if different data among replicas is
 less then a whole object data(eg,4MB).
 why not add (offset,len) to pglog? If so, the missing list can contain
 (object, offset, len), then we can reduce recover data.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Xinze Chi
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why not add (offset,len) to pglog

2015-12-25 Thread Sage Weil

On Fri, 25 Dec 2015, Ning Yao wrote:
> Hi, Dong Wu,
> 
> 1. As I currently work for other things, this proposal is abandon for
> a long time
> 2. This is a complicated task as we need to consider a lots such as
> (not just for writeOp, as well as truncate, delete) and also need to
> consider the different affects for different backends(Replicated, EC).
> 3. I don't think it is good time to redo this patch now, since the
> BlueStore and Kstore  is inprogress, and I'm afraid to bring some
> side-effect.  We may prepare and propose the whole design in next CDS.
> 4. Currently, we already have some tricks to deal with recovery (like
> throttle the max recovery op, set the priority for recovery and so
> on). So this kind of patch may not solve the critical problem but just
> make things better, and I am not quite sure that this will really
> bring a big improvement. Based on my previous test, it works
> excellently on slow disk (say hdd), and also for a short-time
> maintaining. Otherwise, it will trigger the backfill process.  So wait
> for Sage's opinion @sage
> 
> If you are interest on this, we may cooperate to do this.

I think it's a great idea.  We didn't do it before only because it is 
complicated.  The good news is that if we can't conclusively infer exactly 
which parts of hte object need to be recovered from the log entry we can 
always just fall back to recovering the whole thing.  Also, the place 
where this is currently most visible is RBD small writes:

 - osd goes down
 - client sends a 4k overwrite and modifies an object
 - osd comes back up
 - client sends another 4k overwrite
 - client io blocks while osd recovers 4mb

So even if we initially ignore truncate and omap and EC and clones and 
anything else complicated I suspect we'll get a nice benefit.

I haven't thought about this too much, but my guess is that the hard part 
is making the primary's missing set representation include a partial delta 
(say, an interval_set<> indicating which ranges of the file have changed) 
in a way that gracefully degrades to recovering the whole object if we're 
not sure.

In any case, we should definitely have the design conversation!

sage

> 
> Regards
> Ning Yao
> 
> 
> 2015-12-25 14:23 GMT+08:00 Dong Wu :
> > Thanks, from this pull request I learned that this issue is not
> > completed, is there any new progress of this issue?
> >
> > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) :
> >> Yeah, This is good idea for recovery, but not for backfill.
> >> @YaoNing have pull a request about this
> >> https://github.com/ceph/ceph/pull/3837 this year.
> >>
> >> 2015-12-25 11:16 GMT+08:00 Dong Wu :
> >>> Hi,
> >>> I have doubt about pglog, the pglog contains (op,object,version) etc.
> >>> when peering, use pglog to construct missing list,then recover the
> >>> whole object in missing list even if different data among replicas is
> >>> less then a whole object data(eg,4MB).
> >>> why not add (offset,len) to pglog? If so, the missing list can contain
> >>> (object, offset, len), then we can reduce recover data.
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Xinze Chi
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s

2015-12-25 Thread Tyler Bishop

Due to the nature of distributed storage and a filesystem built to distribute 
itself across sequential devices.. you're going to always have poor performance.

Are you unable to use XFS inside the vm?


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.

- Original Message -
From: "J David" 
To: ceph-users@lists.ceph.com
Sent: Thursday, December 24, 2015 1:10:36 PM
Subject: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s

For a variety of reasons, a ZFS pool in a QEMU/KVM virtual machine
backed by a Ceph RBD doesn’t perform very well.

Does anyone have any tuning tips (on either side) for this workload?

A fair amount of the problem is probably related to two factors.

First, ZFS always assumes it is talking to bare metal drives.  This
assumption is baked into it at a very fundamental level but Ceph is
pretty much the polar opposite of that.  For one thing, this makes any
type of write caching moderately terrifying from a dataloss
standpoint.

Although, to be fair, we run some non-critical KVM VM’s with ZFS
filesystems and cache=writeback with no observed ill-effects.  From
the available information, it *seems* safe to do that, but it’s not
certain whether under enough stress and the wrong crash at the wrong
moment, a lost/corrupted pool would be the result.  ZFS is notorious
for exploding if the underlying subsystem lies to it about whether
data has been permanently written to disk (that bare-metal assumption
again); it’s not an area that encourages pressing one’s luck.

The second issue is that ZFS likes a huge recordsize.  It uses small
blocks for small files, but as soon as a file grows a little bit, it
is happy to use 128KiB blocks (again assuming it’s talking to a
physical disk that can do a sequential read of a whole block with
minimal added overhead because the head was already there for the
first byte and what’s a little wasted bandwidth on a 6Gbps SAS bus
that has nothing else to do).

Ceph on the other hand *always* has something else to do, so a 128K
read-modify-write cycle to change one byte in the middle of a file
winds up being punishingly wasteful.

The RBD striping explanation ( on
http://docs.ceph.com/docs/hammer/man/8/rbd/ ) seems to suggest that
the default object size is 4M, so at least a single 128K read/write
should only hit one or (at most) two objects.

Whether it’s one or two seems to depend on whether ZFS has a useful
interpretation of track size, which it may not.  One such virtual
machine reports for a 1TB ceph image, 62 sectors of 512 bytes per
track, or 31K tracks.  Which could lead to a fair number of
object-straddling reads and writes at a 128K object size.

So the main impact of that is massive write amplification; writing one
byte can turn into reading and writing 128K from/to 2-6 different
OSDs.  All of which winds up passing over the storage LAN, introducing
tons of latency compared to that hypothetical 6Gbps SAS read that ZFS
is designed to expect.

If it helps establish a baseline, the reason this subject comes up is
that currently ZFS filesystems on RBD-backed QEMU VM’s do stuff like
this:

(iostat -x at 10-second intervals)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 0.00   41.00   31.3086.35  4006.40
113.22 2.07   27.18   24.10   31.22  13.82  99.92
vdc   0.00 0.00  146.30   38.10   414.95  4876.80
57.39 2.46   13.64   10.36   26.25   5.42  99.96
vdc   0.00 0.00  127.30  102.20   256.40 13081.60
116.24 2.079.198.579.97   4.35  99.88
vdc   0.00 0.00  160.80  160.70   297.30 10592.80
67.75 1.213.761.735.78   2.91  93.68

That’s… not great… for a low-load 10G LAN Ceph cluster with 60 Intel
DC S37X0 SSD’s.

Is there some tuning that could be done (on any side, ZFS, QEMU, or
Ceph) to optimize performance?

Are there any metrics we could collect to gain more insight into what
and where the bottlenecks are?

Some combination of changing the ZFS max recordsize, the QEMU virtual
disk geometry, and Ceph backend settings seems like it might make a
big difference, but there are many combinations, and it feels like
guesswork with the available information.

So it seems worthwhile to ask if anyone has been down this road and if
so what they found before spending a week or two rediscovering the
wheel.

Thanks for any advice!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] nfs over rbd problem

2015-12-25 Thread Tyler Bishop

I didn't read the whole thing but if your trying to do HA NFS, you need to run 
OCFS2 on your RBD and disable read/write caching on the rbd client. 

From: "Steve Anthony"  
To: ceph-users@lists.ceph.com 
Sent: Friday, December 25, 2015 12:39:01 AM 
Subject: Re: [ceph-users] nfs over rbd problem 

I've run into many problems trying to run RBD/NFS Pacemaker like you describe 
on a two node cluster. In my case, most of the problems were a result of a) no 
quorum and b) no STONITH. If you're going to be running this setup in 
production, I *highly* recommend adding more nodes (if only to maintain 
quorum). Without quorum, you will run into split-brain situations where you'll 
see anything from both nodes starting the same services to configuration 
changes disappearing when the nodes can't agree on a recent revision. 

In this case, it looks like a STONITH problem. Without a fencing mechanism, 
node2 cannot be 100% certain that node1 is dead. When you manually shutdown 
corosync, I suspect that as part of the process, node1 alerts the cluster that 
it's leaving as part of a planned shutdown. When it just disappears, node2 
can't determine what happened, it just knows node1 isn't answering anymore. 
Node1 is not necessarily down, there could just be a network problem between 
node1 and node2. In such a scenario, it would be very bad for node2 to 
map/mount the RBDs and start writing data while node1 is still providing the 
same service. Quorum can help here too, but for node2 to be certain node1 is 
down, it needs an out-of-band method to force power off node1, eg. I use IPMI 
on a separate physical network. 

So...add mode nodes as quorum members. You can set weight restrictions on the 
resource groups to prevent them from running on these nodes if the are not as 
powerful as the nodes you're using now. Then, add a STONITH mechanism on all 
the nodes, and verify it works. 

Once you do that, you should see things act the way you expect. Good luck! 

-Steve 

On 12/19/2015 03:46 AM, maoqi1982 wrote: 

Hi list 
I have a test ceph cluster include 3 nodes (node0: mon, node1: osd and nfs 
server1, node2 osd and nfs server2). 
os :centos6.6 ,kernel :3.10.94-1.el6.elrepo.x86_64, ceph version 0.94.5 
I followed the http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ 
instructions to setup an active/standy NFS environment. 
when using commands " # service corosync stop or # poweroff " on node1 , the 
fail over switch situation went fine ( nfs server take over by node2). But when 
I testing the situation of cutting off the power of node1, the switch is 
failed. 
1. [root@node1 ~]# crm status 
Last updated: Fri Dec 18 17:14:19 2015 
Last change: Fri Dec 18 17:13:29 2015 
Stack: classic openais (with plugin) 
Current DC: node1 - partition with quorum 
Version: 1.1.11-97629de 
2 Nodes configured, 3 expected votes 
8 Resources configured 
Online: [ node1 node2 ] 
Resource Group: g_rbd_share_1 
p_rbd_map_1 (ocf::ceph:rbd.in): Started node1 
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node1 
p_export_rbd_1 (ocf::heartbeat:exportfs): Started node1 
p_vip_1 (ocf::heartbeat:IPaddr): Started node1 
Clone Set: clo_nfs [g_nfs] 
Started: [ node1 node2 ] 
2. [root@node1 ~]# service corosync stop 
[root@node2 cluster]# crm status 
Last updated: Fri Dec 18 17:14:59 2015 
Last change: Fri Dec 18 17:13:29 2015 
Stack: classic openais (with plugin) 
Current DC: node2 - partition WITHOUT quorum 
Version: 1.1.11-97629de 
2 Nodes configured, 3 expected votes 
8 Resources configured 
Online: [ node2 ] 
OFFLINE: [ node1 ] 
Resource Group: g_rbd_share_1 
p_rbd_map_1 (ocf::ceph:rbd.in): Started node2 
p_fs_rbd_1 (ocf::heartbeat:Filesystem): Started node2 
p_export_rbd_1 (ocf::heartbeat:exportfs): Started node2 
p_vip_1 (ocf::heartbeat:IPaddr): Started node2 
Clone Set: clo_nfs [g_nfs] 
Started: [ node2 ] 
Stopped: [ node1 ] 

3. cut off node1 power manually 
[root@node2 cluster]# crm status 
Last updated: Fri Dec 18 17:23:06 2015 
Last change: Fri Dec 18 17:13:29 2015 
Stack: classic openais (with plugin) 
Current DC: node2 - partition WITHOUT quorum 
Version: 1.1.11-97629de 
2 Nodes configured, 3 expected votes 
8 Resources configured 
Online: [ node2 ] 
OFFLINE: [ node1 ] 
Clone Set: clo_nfs [g_nfs] 
Started: [ node2 ] 
Stopped: [ node1 ] 
Failed actions: 
p_rbd_map_1_start_0 on node2 'unknown error' (1): call=48, status=Timed Out, 
last-rc-change='Fri Dec 18 17:22:19 2015', queued=0ms, exec=20002ms 
corosync.log: 
Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Transitional 
membership event on ring 668: memb=1, new=0, lost=1 
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: memb: node2 1211279552 
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: lost: node1 1194502336 
Dec 18 17:22:19 corosync [pcmk ] notice: pcmk_peer_update: Stable membership 
event on ring 668: memb=1, new=0, lost=0 
Dec 18 17:22:19 corosync [pcmk ] info: pcmk_peer_update: MEMB: node2 1211279552 
Dec 18 17:22:19 corosync [pcmk ]

Re: [ceph-users] How to configure if there are tow network cards in Client

2015-12-25 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You would look to standard Linux routing to take care of this. If NIC1
is on the same subnet as the Ceph cluster, then it will automatically
work. If the NIC is not on the Ceph subnet, then you would use a
static route to route traffic for the Ceph network through NIC1. You
may need a corresponding route on the router to send traffic to that
NIC, but I assume that is already working. There is no Ceph
configuration that you need to do.

I'm moving this to the ceph-users list where it is more appropriate.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Dec 25, 2015 at 7:04 AM, 蔡毅  wrote:
> Hi all,
> When we read the code, we haven’t find the function that the client can 
> bind a specific IP. In Ceph’s configuration, we could only find the parameter 
> “public network”, but it seems acts on the OSD but not the client.
> There is a scenario that the client has two network cards named NIC1 and 
> NIC2. The NIC1 is responsible for communicating with cluster (monitor and 
> RADOS) and the NIC2 has other services except Ceph’s client. So   we need the 
> client can bind specific IP in order to differentiate the IP communicating 
> with cluster from another IP serving other applications. We want to know is 
> there any configuration in Ceph to achieve this function? If there is, how 
> could we configure the IP? if not, could we add this function in Ceph? Thank 
> you so much.
> Best regards,
> Cai Yi

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWfiZ6CRDmVDuy+mK58QAAAHAQAK3Yjxbhy8UETI6z9FtD
4F/tIw+7t6w9Il3Yp3a79swDcpV5+jiRqvkx7qDmK8ZpshbSxhOyfKrtp24R
qbCFTJceTNWJbgHzqc2D6RdpG4lmphviXJqMc7UUxWbXxG00yY03cVYg9X1y
+i3ZMZ5goSOVY43TVF5/VMb1/IWaUUICTfiIm1R0X2fLBwP3EPRK4l81GlwF
bsnuhRcF1eT0YGKaP4cASQTHDw1Osr2yRIJkVjhml+XbJTJ4ZzM5J9k+BHj3
mNXaWl/7I8mBMjShQu2hTx2xjLv/In+q93ZwOvv32W2JX8FNV3Q9a3TEcixR
FAZ++bE31O0/vt5m4CscDCBZmEA09uQ4WIScrvXMgGjTs0YhK9MbZfpJd3d2
X98BKODUgptDVt9HCi+X5gyq0j03dzJM4sF30bmwPBme07J9/D9Yrs3LtqoA
eJFyWR52fbcI7ESZAzo9yw3BbPwFH/6+E0K1J0x/zCE/bJMNGJs/Q+qcZTGm
X9Q0vUN5gT6bMbKshAoR7XSAuxLpdrtX+x/eHQ1tZiRk+I38MczicCHHEZ+Z
b3zRvWMjSqbhufOK4cM0NSVEAhN40+J2+VfxRQ81/WfG3JOAwLTNPaLDpkJF
sQxEyYiKJRZO7f24g0KdNlXm62shzDjeE0pWJh7h29onWQkvYMzgQGklbWQV
zYU9
=fpri
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help! OSD host failure - recovery without rebuilding OSDs

Re: [ceph-users] why not add (offset,len) to pglog

Re: [ceph-users] why not add (offset,len) to pglog

Re: [ceph-users] Tuning ZFS + QEMU/KVM + Ceph RBD’s

Re: [ceph-users] nfs over rbd problem

Re: [ceph-users] How to configure if there are tow network cards in Client

6 matches

Site Navigation

Mail list logo

Footer information