Re: [ceph-users] Ceph data consistency

Paweł Sadowski Tue, 30 Dec 2014 01:15:08 -0800

On 12/30/2014 09:40 AM, Chen, Xiaoxi wrote:
> Hi,
>    First of all, the data is safe since it's persistent in journal, if error 
> occurs on OSD data partition, replay the journal will get the data back.
Agree. Data are safe in journal. But when journal is flushed data are
moved to a filestore and not flushed to disk immediately.
>    And,  there is a wbthrottle there, you can config how much data(ios, 
> bytes, inodes) you wants to remain in memory. A background thread will start 
> to flush data into disk when  any of the value exceeds 
> "filestore_wbthrottle_[xfs,btrfs]_[bytes,ios,inodes]_start_flusher",  and 
> will block the filestore op thread when hard limit exceeds. You could set 
> these value to something smaller if you still not feeling comfortable:)
I assume that you are talking about WBThrottle::entry()
(src/os/WBThrottle.cc). There is fsync/fdatasync there but it's return
value isn't checked at all. So if you call *write* you have data in
dirty buffer. Then you flush that data to disk by calling *fsync*
without checking it's return value. If there was an IO error *fsync*
will return -1 meaning data has been lost. OSD will not be aware of this.



>                                                               Xiaoxi
>     
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Pawe? Sadowski
> Sent: Tuesday, December 30, 2014 4:10 PM
> To: ceph-users
> Subject: [ceph-users] Ceph data consistency
>
> Hi,
>
> On our Ceph cluster from time to time we have some inconsistent PGs (after 
> deep-scrub). We have some issues with disk/sata cables/lsi controller causing 
> IO errors from time to time (but that's not the point in this case).
>
> When IO error occurs on OSD journal partition everything works as is should 
> -> OSD is crashed and that's ok - Ceph will handle that.
>
> But when IO error occurs on OSD data partition during journal flush OSD 
> continue to work. After calling *writev* (in buffer::list::write_fd) OSD does 
> check return code from this call but does NOT verify if write has been 
> successful to disk (data are still only in memory and there is no fsync). 
> That way OSD thinks that data has been stored on disk but it might be 
> discarded (during sync dirty page will be reclaimed and you'll see "lost page 
> write due to I/O error" in dmesg).
>
> Since there is no checksumming of data I just wanted to make sure that this 
> is by design. Maybe there is a way to tell OSD to call fsync after write and 
> have data consistent?
>
-- 
PS

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph data consistency

Reply via email to