[ceph-users] Please guide us in identifying the cause of the data miss in EC pool

wu_chu...@qq.com Tue, 30 Jul 2024 20:48:22 -0700

Dear Ceph team:&nbsp; &nbsp; &nbsp;On July 13th at 4:55 AM, our Ceph cluster 
experienced a significant power outage in the data center, causing a large 
number of OSDs to power off and restart (total: 1172, down: 821). Approximately 
two hours later, all OSDs successfully started, and the cluster resumed its 
services. However, around 6 PM, the business department reported that some 
files, which had been successfully written (via the RGW service), were failing 
to download, and the number of such files was quite significant. Consequently, 
we began a series of investigations:



1. The incident occurred at 04:55. At 05:01, we executed noout, nobackfill, and 
norecover. At 06:22, we executed `ceph osd pause`. By 07:23, all OSDs were 
UP&amp;IN, and subsequently, we executed `ceph osd unpause`.


2. We randomly selected a problematic file and attempted to download it via the 
S3 API. The RGW returned "No such key".


3. The RGW logs showed op status=-2, http status=200. We also checked the 
upload logs, which indicated 2024-07-13 04:19:20.052, op status=0, 
http_status=200.


4. We set debug_rgw=20 and attempted to download the file again. It was found 
that a 4M chunk(this file is 64M) failed to get.


5. Using rados get for this chunk returned: "No such file or directory".


6. Setting debug_osd=20, we observed get_object_context: obc NOT found in cache.


7. Setting debug_bluestore=20, we saw get_onode oid xxx, key xxx != 
'0xfffffffffffffffeffffffffffffffff'o'.


8. We stopped the primary OSD and tried to get the file again, but the result 
was the same. The object’s corresponding PG state was 
active+recovery_wait+degraded.


9. Using ceph-objectstore-tool --op list &amp;&amp; --op log, we could not find 
the object information. The ceph-kvstore-tool rocksdb command also did not 
reveal anything new.


10. If an OSD had lost data, the PG state should have been unfound or 
inconsistency.


11. We started reanalyzing the startup logs of the OSDs related to the PG. The 
pool was set to erasure-code 6-3 with 9 OSDs. Six of these OSDs had restarted, 
and after peering, the PG state became ACTIVE.


12. We divided the lost files, and the upload time was before the failure 
occurred. The earliest upload time was around 1 am, and the successful upload 
records could be found in the RGW log


13. We have submitted an issue on the Ceph issue 
tracker:&nbsp;https://tracker.ceph.com/issues/66942, it includes the original 
logs needed for troubleshooting. However, four days have passed without any 
response. In desperation, we are sending this email, hoping that someone from 
the Ceph team can guide us as soon as possible.


We are currently in a difficult situation and hope you can provide guidance. 
Thank you.



Best regards.





wu_chu...@qq.com
wu_chu...@qq.com
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Please guide us in identifying the cause of the data miss in EC pool

Reply via email to