Thanks for the info. I'll try that for now and see if it helps. Fri Feb 13 2015 at 09:00:59 użytkownik Alexandre DERUMIER < aderum...@odiso.com> napisał:
> >>Can this timeout be increased in some way? I've searched around and > found the /sys/block/sdx/device/timeout knob, which in my case is set to > 30s. > > yes, sure > > echo 60 > /sys/block/sdx/device/timeout > > for 60s for example > > > ----- Mail original ----- > De: "Krzysztof Nowicki" <krzysztof.a.nowi...@gmail.com> > À: "Andrey Korolyov" <and...@xdel.ru>, "aderumier" <aderum...@odiso.com> > Cc: "ceph-users" <ceph-users@lists.ceph.com> > Envoyé: Vendredi 13 Février 2015 08:18:26 > Objet: Re: [ceph-users] OSD slow requests causing disk aborts in KVM > > Thu Feb 12 2015 at 16:23:38 użytkownik Andrey Korolyov < and...@xdel.ru > > napisał: > > > > On Fri, Feb 6, 2015 at 12:16 PM, Krzysztof Nowicki > < krzysztof.a.nowi...@gmail.com > wrote: > > Hi all, > > > > I'm running a small Ceph cluster with 4 OSD nodes, which serves as a > storage > > backend for a set of KVM virtual machines. The VMs use RBD for disk > storage. > > On the VM side I'm using virtio-scsi instead of virtio-blk in order to > gain > > DISCARD support. > > > > Each OSD node is running on a separate machine, using 3TB WD Black drive > + > > Samsung SSD for journal. The machines used for OSD nodes are not equal in > > spec. Three of them are small servers, while one is a desktop PC. The > last > > node is the one causing trouble. During high loads caused by remapping > due > > to one of the other nodes going down I've experienced some slow > requests. To > > my surprise however these slow requests caused aborts from the block > device > > on the VM side, which ended up corrupting files. > > > > What I wonder if such behaviour (aborts) is normal in case slow requests > > pile up. I always though that these requests would be delayed but > eventually > > they'd be handled. Are there any tunables that would help me avoid such > > situations? I would really like to avoid VM outages caused by such > > corruption issues. > > > > I can attach some logs if needed. > > > > Best regards > > Chris > > Hi, this is unevitable payoff for using scsi backend on a storage > which is capable to slow enough operations. There was some > argonaut/bobtail-era discussions in ceph ml, may be those readings can > be interesting for you. AFAIR the scsi disk would about after 70s of > non-receiving ack state for a pending operation. > > > > Can this timeout be increased in some way? I've searched around and found > the /sys/block/sdx/device/timeout knob, which in my case is set to 30s. > > As for the versions I'm running all Ceph nodes on Gentoo with Ceph version > 0.80.5. The VM guest in question is running Ubuntu 12.04 LTS with kernel > 3.13. The guest filesystem is BTRFS. > > I'm thinking that the corruption may be some BTRFS bug. >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com