On 12/30/2014 09:40 AM, Chen, Xiaoxi wrote: > Hi, > First of all, the data is safe since it's persistent in journal, if error > occurs on OSD data partition, replay the journal will get the data back. Agree. Data are safe in journal. But when journal is flushed data are moved to a filestore and not flushed to disk immediately. > And, there is a wbthrottle there, you can config how much data(ios, > bytes, inodes) you wants to remain in memory. A background thread will start > to flush data into disk when any of the value exceeds > "filestore_wbthrottle_[xfs,btrfs]_[bytes,ios,inodes]_start_flusher", and > will block the filestore op thread when hard limit exceeds. You could set > these value to something smaller if you still not feeling comfortable:) I assume that you are talking about WBThrottle::entry() (src/os/WBThrottle.cc). There is fsync/fdatasync there but it's return value isn't checked at all. So if you call *write* you have data in dirty buffer. Then you flush that data to disk by calling *fsync* without checking it's return value. If there was an IO error *fsync* will return -1 meaning data has been lost. OSD will not be aware of this.
> Xiaoxi > > -----Original Message----- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Pawe? Sadowski > Sent: Tuesday, December 30, 2014 4:10 PM > To: ceph-users > Subject: [ceph-users] Ceph data consistency > > Hi, > > On our Ceph cluster from time to time we have some inconsistent PGs (after > deep-scrub). We have some issues with disk/sata cables/lsi controller causing > IO errors from time to time (but that's not the point in this case). > > When IO error occurs on OSD journal partition everything works as is should > -> OSD is crashed and that's ok - Ceph will handle that. > > But when IO error occurs on OSD data partition during journal flush OSD > continue to work. After calling *writev* (in buffer::list::write_fd) OSD does > check return code from this call but does NOT verify if write has been > successful to disk (data are still only in memory and there is no fsync). > That way OSD thinks that data has been stored on disk but it might be > discarded (during sync dirty page will be reclaimed and you'll see "lost page > write due to I/O error" in dmesg). > > Since there is no checksumming of data I just wanted to make sure that this > is by design. Maybe there is a way to tell OSD to call fsync after write and > have data consistent? > -- PS _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com