Re: [ceph-users] Pause i/o from time to time

Andrei Mikhailovsky Sun, 29 Dec 2013 03:14:30 -0800

Mike, I am using 1.5.0: 

QEMU emulator version 1.5.0 (Debian 1.5.0+dfsg-3ubuntu5~cloud0), Copyright (c) 
2003-2008 Fabrice Bellard


which is installed from the Ubuntu cloud Havana ppa 

Thanks 


-- 
Andrei Mikhailovsky 
Director 
Arhont Information Security 

Web: http://www.arhont.com 
http://www.wi-foo.com 
Tel: +44 (0)870 4431337 
Fax: +44 (0)208 429 3111 
PGP: Key ID - 0x2B3438DE 
PGP: Server - keyserver.pgp.com 

DISCLAIMER 

The information contained in this email is intended only for the use of the 
person(s) to whom it is addressed and may be confidential or contain legally 
privileged information. If you are not the intended recipient you are hereby 
notified that any perusal, use, distribution, copying or disclosure is strictly 
prohibited. If you have received this email in error please immediately advise 
us by return email at and...@arhont.com and delete and purge the email and any 
attachments without making a copy. 


----- Original Message -----

From: "Mike Dawson" <mike.daw...@cloudapt.com> 
To: "Andrei Mikhailovsky" <and...@arhont.com>, "Uwe Grohnwaldt" 
<u...@grohnwaldt.eu> 
Cc: ceph-users@lists.ceph.com 
Sent: Sunday, 29 December, 2013 8:58:29 AM 
Subject: Re: [ceph-users] Pause i/o from time to time 

What version of qemu do you have? 

The issues I had were fixed once I upgraded qemu to >=1.4.2 which 
includes a critical rbd patch for asynchronous io from Josh Durgin. 

Cheers, 
Mike 

On 12/28/2013 4:09 PM, Andrei Mikhailovsky wrote: 
> 
> Hi guys, 
> 
> Did anyone figure out what could be causing this problem and a workaround? 
> 
> I've noticed a very annoying behaviour with my vms. It seems to happen 
> randomly about 5-10 times a day and the pauses last between 2-10 
> minutes. It happens across all vms on all host servers in my cluster. I 
> am running 0.67.4 on ubuntu 12.04 with 3.11 kernel from backports. 
> 
> Initially i though that these pauses are caused by the scrubbing issue 
> reported by Mike, however, I've also noticed the stalls when the cluster 
> is not scrubbing. Both of my osd servers are pretty idle (load around 1 
> to 2) with osds are less than 10% utilised. 
> 
> Unlike Uwe's case, I am not using iscsi, but plain rbd with qemu and I 
> do not see any i/o errors in dmesg or kernel panics. the vms just freeze 
> and become unresponsive, so i can't ssh into it or do simple commands 
> like ls. VMs do respond to pings though. 
> 
> Thanks 
> 
> Andrei 
> 
> ------------------------------------------------------------------------ 
> *From: *"Uwe Grohnwaldt" <u...@grohnwaldt.eu> 
> *To: *ceph-users@lists.ceph.com 
> *Sent: *Thursday, 24 October, 2013 8:31:42 AM 
> *Subject: *Re: [ceph-users] Pause i/o from time to time 
> 
> Hello ceph-users, 
> 
> we're hitting a similar problem last Thursday and today. We have a 
> cluster consisting of 6 storagenodes containing 70 osds (JBOD 
> configuration). We created several rbd devices and mapped them on 
> dedicated server and exporting them via targetcli. This iscsi target are 
> connected to Citrix XenServer 6.1 (with HF30) and XenServer 6.2 (HF4). 
> 
> In the last time some disks died. After this some errors occured on this 
> dedicated iscsitarget: 
> Oct 23 15:19:42 targetcli01 kernel: [673836.709887] end_request: I/O 
> error, dev rbd4, sector 2034037064 
> Oct 23 15:19:42 targetcli01 kernel: [673836.713596] 
> test_bit(BIO_UPTODATE) failed for bio: ffff880127546c00, err: -6 
> Oct 23 15:19:43 targetcli01 kernel: [673837.497382] end_request: I/O 
> error, dev rbd4, sector 2034037064 
> Oct 23 15:19:43 targetcli01 kernel: [673837.501323] 
> test_bit(BIO_UPTODATE) failed for bio: ffff880124d933c0, err: -6 
> 
> These errors go through up to the virtual machines and lead to readonly 
> filesystems. We could trigger this behavior with set one disk to out. 
> 
> We are using Ubuntu 13.04 with latest stable ceph (ceph version 0.67.4 
> (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) 
> 
> Our ceph.conf is like this: 
> 
> [global] 
> filestore_xattr_use_omap = true 
> mon_host = 10.200.20.1,10.200.20.2,10.200.20.3 
> osd_journal_size = 1024 
> public_network = 10.200.40.0/16 
> mon_initial_members = ceph-mon01, ceph-mon02, ceph-mon03 
> cluster_network = 10.210.40.0/16 
> auth_supported = none 
> fsid = 9283e647-2b57-4077-b427-0d3d656233b3 
> 
> [osd] 
> osd_max_backfills = 4 
> osd_recovery_max_active = 1 
> 
> [osd.0] 
> public_addr = 10.200.40.1 
> cluster_addr = 10.210.40.1 
> .... 
> .... 
> 
> After the first outage we set osd_max_backfill to 8, after the second 
> one to 4 but it didn't help. It seems like it is the bug mentioned at 
> http://tracker.ceph.com/issues/6278 . The problem is, that this is a 
> production environment and the problems began after we moved several VMs 
> to it. In our test environment we can't reproduct it but we are working 
> on a larger testinstallation. 
> 
> Does anybody have an idea how to investigate further without destroying 
> virtual machines? ;) 
> 
> Sometimes these IO errors lead to kernel panics on the iscsi target 
> machine. The targetcli/lio config is a simple default config without any 
> tuning or big configurations. 
> 
> 
> Mit freundlichen Grüßen / Best Regards, 
> Uwe Grohnwaldt 
> 
> ----- Original Message ----- 
> > From: "Timofey" <timo...@koolin.ru> 
> > To: "Mike Dawson" <mike.daw...@cloudapt.com> 
> > Cc: ceph-users@lists.ceph.com 
> > Sent: Dienstag, 17. September 2013 22:37:44 
> > Subject: Re: [ceph-users] Pause i/o from time to time 
> > 
> > I have examined logs. 
> > Yes, first time it can be scrubbing. It repaired some self. 
> > 
> > I had 2 servers before first problem: one dedicated for osd (osd.0), 
> > and second - with osd and websites (osd.1). 
> > After problem I add third server - dedicated for osd (osd.2) and call 
> > ceph osd set out osd.1 for replace data. 
> > 
> > In ceph -s i saw normal replacing process and all work good about 5-7 
> > hours. 
> > Then I have many misdirected records (few hundreds per second): 
> > osd.0 [WRN] client.359671 misdirected client.359671.1:220843 pg 
> > 2.3ae744c0 to osd.0 not [2,0] in e1040/1040 
> > and errors in i/o operations. 
> > 
> > Now I have about 20GB ceph logs with this errors. (I don't work with 
> > cluster now - I copy out all data on hdd and work from hdd). 
> > 
> > Is any way have local software raid1 with ceph rbd and local image 
> > (for work when ceph fail or work slow by any reason). 
> > I tried mdadm but it work bad - server hang up every few hours. 
> > 
> > > You could be suffering from a known, but unfixed issue [1] where 
> > > spindle contention from scrub and deep-scrub cause periodic stalls 
> > > in RBD. You can try to disable scrub and deep-scrub with: 
> > > 
> > > # ceph osd set noscrub 
> > > # ceph osd set nodeep-scrub 
> > > 
> > > If your problem stops, Issue #6278 is likely the cause. To 
> > > re-enable scrub and deep-scrub: 
> > > 
> > > # ceph osd unset noscrub 
> > > # ceph osd unset nodeep-scrub 
> > > 
> > > Because you seem to only have two OSDs, you may also be saturating 
> > > your disks even without scrub or deep-scrub. 
> > > 
> > > http://tracker.ceph.com/issues/6278 
> > > 
> > > Cheers, 
> > > Mike Dawson 
> > > 
> > > 
> > > On 9/16/2013 12:30 PM, Timofey wrote: 
> > >> I use ceph for HA-cluster. 
> > >> Some time ceph rbd go to have pause in work (stop i/o operations). 
> > >> Sometime it can be when one of OSD slow response to requests. 
> > >> Sometime it can be my mistake (xfs_freeze -f for one of 
> > >> OSD-drive). 
> > >> I have 2 storage servers with one osd on each. This pauses can be 
> > >> few minutes. 
> > >> 
> > >> 1. Is any settings for fast change primary osd if current osd work 
> > >> bad (slow, don't response). 
> > >> 2. Can I use ceph-rbd in software raid-array with local drive, for 
> > >> use local drive instead of ceph if ceph cluster fail? 
> > >> _______________________________________________ 
> > >> ceph-users mailing list 
> > >> ceph-users@lists.ceph.com 
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > >> 
> > 
> > _______________________________________________ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pause i/o from time to time

Reply via email to