On Thu, Sep 13, 2018 at 1:54 PM <patrick.mcl...@sony.com> wrote:
>
> On 2018-09-12 19:49:16-07:00 Jason Dillaman wrote:
>
>
> On Wed, Sep 12, 2018 at 10:15 PM &lt;patrick.mcl...@sony.com&gt; wrote:
> &amp;gt;
> &amp;gt; On 2018-09-12 17:35:16-07:00 Jason Dillaman wrote:
> &amp;gt;
> &amp;gt;
> &amp;gt; Any chance you know the LBA or byte offset of the corruption so I can
> &amp;gt; compare it against the log?
> &amp;gt;
> &amp;gt; The LBAs of the corruption are 0xA74F000 through 175435776
>
> Are you saying the corruption starts at byte offset 175435776 from the
> start of the RBD image? If so, that would correspond to object 0x29:
>
>
> Yes, that is where we are seeing the corruption. We have also noticed that 
> different runs of export-diff seem to corrupt the data in different ways.

If you can repeat it on this image by re-running the export, can you
please collect two "rbd export-diff" log outputs with the
"--debug-rbd=20" and also with "--debug-rados=20" optionals? I've
opened a tracker ticket [1] where you can paste any additional logs.

> 2018-09-12 21:22:17.117246 7f268928f0c0 20 librbd::DiffIterate: object
> rbd_data.4b383f1e836edc.0000000000000029: list_snaps complete
> 2018-09-12 21:22:17.117249 7f268928f0c0 20 librbd::DiffIterate:   diff
> [499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
> end_exists=1
> 2018-09-12 21:22:17.117251 7f268928f0c0 20 librbd::DiffIterate:
> diff_iterate object rbd_data.4b383f1e836edc.0000000000000029 extent
> 0~4194304 from [0,4194304]
> 2018-09-12 21:22:17.117268 7f268928f0c0 20 librbd::DiffIterate:  opos
> 0 buf 0~4194304 overlap
> [499712~4096,552960~4096,589824~4096,3338240~4096,3371008~4096,3469312~4096,3502080~4096,3534848~4096,3633152~4096]
> 2018-09-12 21:22:17.117270 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 499712~4096 logical 172466176~4096
> 2018-09-12 21:22:17.117271 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 552960~4096 logical 172519424~4096
> 2018-09-12 21:22:17.117272 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 589824~4096 logical 172556288~4096
> 2018-09-12 21:22:17.117273 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3338240~4096 logical 175304704~4096
> 2018-09-12 21:22:17.117274 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3371008~4096 logical 175337472~4096
> 2018-09-12 21:22:17.117275 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3469312~4096 logical 175435776~4096  &amp;lt;-------
> 2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3502080~4096 logical 175468544~4096
> 2018-09-12 21:22:17.117276 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3534848~4096 logical 175501312~4096
> 2018-09-12 21:22:17.117277 7f268928f0c0 20 librbd::DiffIterate:
> overlap extent 3633152~4096 logical 175599616~4096
>
> ... and I can see it being imported ...
>
> 2018-09-12 22:07:38.698380 7f23ab2ec0c0 20 librbd::io::ObjectRequest:
> 0x5615cb507da0 send: write rbd_data.38abe96b8b4567.0000000000000029
> 3469312~4096
>
> Therefore, I don't see anything structurally wrong w/ the
> export/import behavior. Just to be clear, did you freeze/coalesce the
> filesystem before you took the snapshot?
>
>
> The filesystem was unmounted at the time of the export, our system is 
> designed to only work on unmounted filesystems.
>
> &amp;gt; On Wed, Sep 12, 2018 at 8:32 PM 
> &amp;lt;patrick.mcl...@sony.com&amp;gt; wrote:
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; Hi Jason,
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; On 2018-09-10 11:15:45-07:00 ceph-users wrote:
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; On 2018-09-10 11:04:20-07:00 Jason Dillaman wrote:
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; &amp;amp;amp;gt; In addition to this, we are seeing a 
> similar type of corruption in another use case when we migrate RBDs and 
> snapshots across pools. In this case we clone a version of an RBD (e.g. 
> HEAD-3) to a new pool and rely on 'rbd export-diff/import-diff' to restore 
> the last 3 snapshots on top. Here too we see cases of fsck and RBD checksum 
> failures.
> &amp;gt; &amp;amp;gt; &amp;amp;amp;gt; We maintain various metrics and logs. 
> Looking back at our data we have seen the issue at a small scale for a while 
> on Jewel, but the frequency increased recently. The timing may have coincided 
> with a move to Luminous, but this may be coincidence. We are currently on 
> Ceph 12.2.5.
> &amp;gt; &amp;amp;gt; &amp;amp;amp;gt; We are wondering if people are 
> experiencing similar issues with 'rbd export-diff / import-diff'. I'm sure 
> many people use it to keep backups in sync. Since it is backups, many people 
> may not inspect the data often. In our use case, we use this mechanism to 
> keep data in sync and actually need the data in the other location often. We 
> are wondering if anyone else has encountered any issues, it's quite possible 
> that many people may have this issue, buts simply don't realize. We are 
> likely hitting it much more frequently due to the scale of our operation 
> (tens of thousands of syncs a day).
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; If you are able to recreate this reliably without 
> tiering, it would
> &amp;gt; &amp;amp;gt; assist in debugging if you could capture RBD debug logs 
> during the
> &amp;gt; &amp;amp;gt; export along w/ the LBA of the filesystem corruption to 
> compare
> &amp;gt; &amp;amp;gt; against.
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; We haven't been able to reproduce this reliably as of 
> yet, as of yet we haven't actually figured out the exact conditions that 
> cause this to happen, we have just been seeing it happen on some percentage 
> of export/import-diff operations.
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt;
> &amp;gt; &amp;amp;gt; Logs from both export-diff and import-diff in a case 
> where the result gets corrupted are attached. Please let me know if you need 
> any more information.
> &amp;gt; &amp;amp;gt;

[1] http://tracker.ceph.com/issues/35974

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to