Hi all,

Recently I had to replace a whole cluster that was a secondary zone of the rgw 
realm and zonegroup "main-store". I removed the original zone and added the new 
one. The sync started (although initially very slow) and now I'm up to about 
85% of data replicated to the new cluster.

(Data) sync seems to have stalled now (and based on the logs of the 
primary/master it seems to request the same objects over and over) and the RGW 
sync instance is spitting out these errors:

debug 2025-11-05T15:01:57.165+0000 7fd9d6929640 -1 rgw data changes log: int 
rgw::cls::fifo::FIFO::push(const DoutPrefixProvider*, const 
std::vector<ceph::buffer::v15_2_0::list>&, optional_yield):1446 canceled too 
many times, giving up: tid=88894
debug 2025-11-05T15:01:57.165+0000 7fd9d6929640 -1 rgw data changes log: 
virtual int RGWDataChangesFIFO::push(const DoutPrefixProvider*, int, 
RGWDataChangesBE::entries&&, optional_yield):
 unable to push to FIFO: data_log.36: (125) Operation canceled
debug 2025-11-05T15:01:57.165+0000 7fd9d6929640 -1 rgw data changes log: ERROR: 
svc.cls->timelog.add() returned -125
debug 2025-11-05T15:01:57.165+0000 7fd9d6929640  0 rgw data changes log: ERROR: 
RGWDataChangesLog::renew_entries returned error r=-125

and

debug 2025-11-05T15:01:42.638+0000 7fd9e5146640 -1 rgw async rados processor: 
virtual int RGWDataChangesFIFO::push(const DoutPrefixProvider*, int, 
ceph::real_time, const string&, ceph::bu
ffer::v15_2_0::list&&, optional_yield): unable to push to FIFO: data_log.36: 
(125) Operation canceled

I figured out this has to do with datalog shard 36 and I can indeed not see the 
status:

bash-5.1$ radosgw-admin datalog status --rgw-realm main-store --shard-id 36
[
    {
        "marker": "",
        "last_update": "0.000000"
    }
]
2025-11-05T15:04:10.622+0000 7f3bbf941900 -1 int 
rgw::cls::fifo::{anonymous}::get_part_info(const DoutPrefixProvider*, 
librados::v14_2_0::IoCtx&, const string&, rados::cls::fifo::part_header*, 
uint64_t, optional_yield):328 fifo::op::GET_PART_INFO failed r=-2 tid=2
2025-11-05T15:04:10.622+0000 7f3bbf941900 -1 int 
rgw::cls::fifo::FIFO::get_part_info(const DoutPrefixProvider*, int64_t, 
rados::cls::fifo::part_header*, optional_yield):2023 get_part_info failed: r=-2 
tid=2
2025-11-05T15:04:10.622+0000 7f3bbf941900 -1 virtual int 
RGWDataChangesFIFO::get_info(const DoutPrefixProvider*, int, 
RGWDataChangesLogInfo*, optional_yield): unable to get part info: 
data_log.36/6: (2) No such file or directory

When checking the rados pool I do see a "data_log.36" object:

bash-5.1$ rados -p backup-store-ams.rgw.log ls | grep -E 'data_log.36'
data_log.36

And the size seems similar to a neighbor object that's working fine:
bash-5.1$ rados -p backup-store-ams.rgw.log stat data_log.36 # Bad
backup-store-ams.rgw.log/data_log.36 mtime 2025-10-25T23:47:04.000000+0000, 
size 156
bash-5.1$ rados -p backup-store-ams.rgw.log stat data_log.35 # Good
backup-store-ams.rgw.log/data_log.35 mtime 2025-11-01T12:27:43.000000+0000, 
size 156

I should add that currently the cluster has a "13 large omap objects" warning 
activated, but I was thinking this was part of the full resync and sharding 
happens after completion, it's unclear of this error is related to this warning 
at all.

I've been going through the ceph FIFO source but I've not been able to pinpoint 
it. The primary cluster is still running OMAP datalog type.

Any help/insights would be greatly appreciated!

Daan
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to