Hi,
We utilize Ceph RBDs for our users' storage and need to keep data synchronized 
across data centres. For this we rely on 'rbd export-diff / import-diff'. 
Lately we have been noticing cases in which the file system on the 'destination 
RBD' is corrupt. We have been trying to isolate the issue, which may or may not 
be due to Ceph. We suspect the problem could be in 'rbd export-diff / 
import-diff' and are wondering if people have been seeing issues with these 
tools. Let me explain our use case and issue in more detail.
We have a number of data centres each with a Ceph cluster storing tens of 
thousands of RBDs. We maintain extra copies of each RBD in other data centres. 
After we are 'done' using a RBD, we create a snapshot and use 'rbd export-diff' 
to create a diff between the most recent 'common' snapshot at the other data 
center. We send the data over the network, and use 'rbd import-diff' on the 
destination. When we apply a diff to a destination RBD we can guarantee its 
'HEAD' is clean. Of course we guarantee that an RBD is only used in one data 
centre at a time.
We noticed corruption at the destination RBD based on fsck failures, further 
investigation showed that checksums on the RBD mismatch as well. Somehow the 
data is sometimes getting corrupted either by our software or 'rbd export-diff 
/ import-diff'. Our investigation suggests that the the problem is in 'rbd 
export-diff/import-diff'. The main evidence of this is that occasionally we 
sync an RBD between multiple data centres. Each sync is a separate job with its 
own 'rbd export-diff'. We noticed that both destination locations have the same 
corruption (and the same checksum) and the source is healthy.
In addition to this, we are seeing a similar type of corruption in another use 
case when we migrate RBDs and snapshots across pools. In this case we clone a 
version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd 
export-diff/import-diff' to restore the last 3 snapshots on top. Here too we 
see cases of fsck and RBD checksum failures.
We maintain various metrics and logs. Looking back at our data we have seen the 
issue at a small scale for a while on Jewel, but the frequency increased 
recently. The timing may have coincided with a move to Luminous, but this may 
be coincidence. We are currently on Ceph 12.2.5.
We are wondering if people are experiencing similar issues with 'rbd 
export-diff / import-diff'. I'm sure many people use it to keep backups in 
sync. Since it is backups, many people may not inspect the data often. In our 
use case, we use this mechanism to keep data in sync and actually need the data 
in the other location often. We are wondering if anyone else has encountered 
any issues, it's quite possible that many people may have this issue, buts 
simply don't realize. We are likely hitting it much more frequently due to the 
scale of our operation (tens of thousands of syncs a day).
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to