Starting in stable release Octopus 15.2.0 and continuing through Octopus 15.2.6 
there is a bug in RGW that could result in data loss. There is both an 
immediate configuration work-around and a fix is intended for Octopus 15.2.7. 
[Note: the bug was first merged in a pre-stable release — Octopus 15.1.0]

This bug is triggered when a read of a large RGW object (i.e., one with at 
least one tail segment) takes longer than 1/2 (one half) the time specified in 
the configuration option rgw_gc_obj_min_wait (2 hours by default, specified in 
seconds). The bug causes the tail segments of that read object to be added to 
the RGW garbage collection queue, which will in turn cause them to be deleted 
after a period of time.

The configuration work-around is to set rgw_gc_obj_min_wait to a large enough 
time (specified in seconds) that will exceed twice the time of the longest read 
you expect. The downside of this configuration change is that it will delay GC 
of deleted objects and will tend to cause the GC queue to become longer. Given 
that there’s a finite amount of space for that queue, if it ever becomes full 
then tail segments will be deleted in-line with object removal operations, and 
that could degrade performance slightly.

The Octopus backport tracker is: https://tracker.ceph.com/issues/48331
The Octopus backport PR is: https://github.com/ceph/ceph/pull/38249
The master branch tracker, which has the history of the bug, is: 
https://tracker.ceph.com/issues/47866

Tracking down this bug was a group effort and many people participated. See the 
master branch tracker for that history. Thanks to everyone who helped out.

Eric

--
J. Eric Ivancich
he / him / his
Red Hat Storage
Ann Arbor, Michigan, USA
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to