Hello,

I have setup two separate Ceph clusters with RGW instance each and trying to 
achieve multisite data synchronization. Primary runs 13.2.5, slave runs 14.2.2 
(I have upgraded slave side from 14.2.1 due to known data corruption during 
transfer due to curl errors). I have emptied slave zone and allowed sync to run 
from beginning to the end. Then I have recalculated MD5 hashes over original 
data and over data in slave zones and found that in some cases they do not 
match. Data corruption is evident.

Comparison of data byte for byte shows that some parts of data just moved 
around (for example file has correct bytes from 0 to 513k and then exactly the 
same bytes from position of 513k up to 1026k - feels like as some sort of 
buffer issue). File size is correct. I have read the RGW sources and could not 
find anything what can cause such sort of behavior however I could not find one 
single piece of code which I would consider critical: during FetchRemote RGW 
obtains object's etag from remote RGW but apparently there nothing what 
recalculates MD5 from actual data and comparing it to etag received to ensure 
that data was correctly transferred. Even broken data then stored to local 
cluster and into bucket index. Then there just nothing further to prevent wrong 
data to reach end user downloading from slave zone.

Such check should happen regardless of sync backend in use: objects with broken 
data should not be stored in bucket index at all. No one needs broken data and 
it is better just to fail object sync to allow it to try again later instead of 
storing faulty data and then allowing end users to download it.

Is it me who cannot see such sort of check or it is actually missing?! And if 
it is not there at all I think it should be quite high on TODO list.

Regards,
Vladimir
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to