Help, my Ceph cluster is losing data slowly over time. I keep finding files that are the same length as they should be, but all the content has been lost & replaced by nulls.
Here is an example: (from a backup I have the original file) [root@blotter docker]# ls -lart /backup/space/docker/ceph-monitor/ceph-w-monitor.py /space/docker/ceph-monitor/ceph-w-monitor.py -rwxrwxrwx 1 root root 7237 Mar 12 07:34 /backup/space/docker/ceph-monitor/ceph-w-monitor.py -rwxrwxrwx 1 root root 7237 Mar 12 07:34 /space/docker/ceph-monitor/ceph-w-monitor.py [root@blotter docker]# sum /backup/space/docker/ceph-monitor/ceph-w-monitor.py 19803 8 [root@blotter docker]# sum /space/docker/ceph-monitor/ceph-w-monitor.py 00000 8 If I had to _guess_ I would blame a recent change to the writeback cache tier layer. I turned it off and flushed it last weekend....about the same time I started to notice this data loss. I disabled it using instructions from here: http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ Basically, I just set it to "forward" and then "flushed it". ceph osd tier cache-mode ssd_cache forward and rados -p ssd_cache cache-flush-evict-all After that I removed the overlay. But that failed (and still fails) with: Finally, tried to remove the cache t from the backing pool, but that failed (still fails) with: $ ceph osd tier remove-overlay cephfs_data Error EBUSY: pool 'cephfs_data' is in use by CephFS via its tier At that point I thought, because I had set the cache-mode to "forward", it would be safe to just leave it as is until I had time to debug further. I should mention that after the cluster settled down and did some scrubbing, there was one inconsistent page. I ran a "ceph fix page xxx" command to resolve that and the health was good again. I can do some experimenting this weekend if somebody wants to help me through it. Otherwise I'll probably try to put the cache-tier back into "writeback" to see if that helps. If not, I'll recreate the entire ceph cluster. Thanks, Blade. P.S. My cluster is made of mixed ARM and x86_64.. $ ceph version ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) # ceph version ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) etc... PPS: $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 2456G 1492G 839G 34.16 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 139G 5.66 185G 36499 cephfs_data 1 235G 9.59 185G 102883 cephfs_metadata 2 33642k 0 185G 5530 ssd_cache 4 0 0 370G 0
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com