On Thu, Sep 13, 2018 at 1:54 PM <patrick.mcl...@sony.com> wrote: > > On 2018-09-12 19:49:16-07:00 Jason Dillaman wrote: > > > On Wed, Sep 12, 2018 at 10:15 PM <patrick.mcl...@sony.com> wrote: > &gt; > &gt; On 2018-09-12 17:35:16-07:00 Jason Dillaman wrote: > &gt; > &gt; > &gt; Any chance you know the LBA or byte offset of the corruption so I can > &gt; compare it against the log? > &gt; > &gt; The LBAs of the corruption are 0xA74F000 through 175435776 > > Are you saying the corruption starts at byte offset 175435776 from the > start of the RBD image? If so, that would correspond to object 0x29: > > > Yes, that is where we are seeing the corruption. We have also noticed that > different runs of export-diff seem to corrupt the data in different ways.
If you can repeat it on this image by re-running the export, can you please collect two "rbd export-diff" log outputs with the "--debug-rbd=20" and also with "--debug-rados=20" optionals? I've opened a tracker ticket [1] where you can paste any additional logs. > 2018-09-12 21:22:17.117246 7f268928f0c0 20 librbd::DiffIterate: object > rbd_data.4b383f1e836edc.0000000000000029: list_snaps complete > 2018-09-12 21:22:17.117249 7f268928f0c0 20 librbd::DiffIterate: diff > [499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096] > end_exists=1 > 2018-09-12 21:22:17.117251 7f268928f0c0 20 librbd::DiffIterate: > diff_iterate object rbd_data.4b383f1e836edc.0000000000000029 extent > 0~4194304 from [0,4194304] > 2018-09-12 21:22:17.117268 7f268928f0c0 20 librbd::DiffIterate: opos > 0 buf 0~4194304 overlap > [499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096] > 2018-09-12 21:22:17.117270 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 499712~4096 logical 172466176~4096 > 2018-09-12 21:22:17.117271 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 552960~4096 logical 172519424~4096 > 2018-09-12 21:22:17.117272 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 589824~4096 logical 172556288~4096 > 2018-09-12 21:22:17.117273 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3338240~4096 logical 175304704~4096 > 2018-09-12 21:22:17.117274 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3371008~4096 logical 175337472~4096 > 2018-09-12 21:22:17.117275 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3469312~4096 logical 175435776~4096 &lt;------- > 2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3502080~4096 logical 175468544~4096 > 2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3534848~4096 logical 175501312~4096 > 2018-09-12 21:22:17.117277 7f268928f0c0 20 librbd::DiffIterate: > overlap extent 3633152~4096 logical 175599616~4096 > > ... and I can see it being imported ... > > 2018-09-12 22:07:38.698380 7f23ab2ec0c0 20 librbd::io::ObjectRequest: > 0x5615cb507da0 send: write rbd_data.38abe96b8b4567.0000000000000029 > 3469312~4096 > > Therefore, I don't see anything structurally wrong w/ the > export/import behavior. Just to be clear, did you freeze/coalesce the > filesystem before you took the snapshot? > > > The filesystem was unmounted at the time of the export, our system is > designed to only work on unmounted filesystems. > > &gt; On Wed, Sep 12, 2018 at 8:32 PM > &lt;patrick.mcl...@sony.com&gt; wrote: > &gt; &amp;gt; > &gt; &amp;gt; Hi Jason, > &gt; &amp;gt; > &gt; &amp;gt; On 2018-09-10 11:15:45-07:00 ceph-users wrote: > &gt; &amp;gt; > &gt; &amp;gt; On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote: > &gt; &amp;gt; > &gt; &amp;gt; > &gt; &amp;gt; &amp;amp;gt; In addition to this, we are seeing a > similar type of corruption in another use case when we migrate RBDs and > snapshots across pools. In this case we clone a version of an RBD (e.g. > HEAD-3) to a new pool and rely on 'rbd export-diff/import-diff' to restore > the last 3 snapshots on top. Here too we see cases of fsck and RBD checksum > failures. > &gt; &amp;gt; &amp;amp;gt; We maintain various metrics and logs. > Looking back at our data we have seen the issue at a small scale for a while > on Jewel, but the frequency increased recently. The timing may have coincided > with a move to Luminous, but this may be coincidence. We are currently on > Ceph 12.2.5. > &gt; &amp;gt; &amp;amp;gt; We are wondering if people are > experiencing similar issues with 'rbd export-diff / import-diff'. I'm sure > many people use it to keep backups in sync. Since it is backups, many people > may not inspect the data often. In our use case, we use this mechanism to > keep data in sync and actually need the data in the other location often. We > are wondering if anyone else has encountered any issues, it's quite possible > that many people may have this issue, buts simply don't realize. We are > likely hitting it much more frequently due to the scale of our operation > (tens of thousands of syncs a day). > &gt; &amp;gt; > &gt; &amp;gt; If you are able to recreate this reliably without > tiering, it would > &gt; &amp;gt; assist in debugging if you could capture RBD debug logs > during the > &gt; &amp;gt; export along w/ the LBA of the filesystem corruption to > compare > &gt; &amp;gt; against. > &gt; &amp;gt; > &gt; &amp;gt; We haven't been able to reproduce this reliably as of > yet, as of yet we haven't actually figured out the exact conditions that > cause this to happen, we have just been seeing it happen on some percentage > of export/import-diff operations. > &gt; &amp;gt; > &gt; &amp;gt; > &gt; &amp;gt; Logs from both export-diff and import-diff in a case > where the result gets corrupted are attached. Please let me know if you need > any more information. > &gt; &amp;gt; [1] http://tracker.ceph.com/issues/35974 -- Jason _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com