Thanks, Pierre and Thierry.

After quite some time of poring over these debug logs, I've found some 
anomalies and they seem like they're matching up with the idea that the 
affected replica isn't updating its own RUV correctly.

The logs show a change being made, and it lists the CSN of the change. The 
first anomalies are here, but they probably aren't terribly significant. The 
CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the 
future from when the change was made and logged. Also, the next part of the CSN 
is supposed to be a serial number for when there are changes made during the 
same second of the timestamp. In the case I was looking at, that serial was 
0xb231. I'm certain that this replica didn't record another 45000 changes in 
that second.

Then it shows the server committing the change to the changelog. It shows it 
"processing data" for over 16000 other CSNs, and it takes about 25 seconds to 
complete.

It then starts a replication session with the peer and prints out the peer's 
(consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for 
itself shows the maxCSN for itself with a timestamp from almost 4 months ago. 
It is greater than the maxCSN for itself in the consumer's RUV, though, by a 
little. (The replicagenerations are equal, though.)

It then claims to send 7 changes, all of which are skipped because "empty". It 
then claims that there are "No more updates to send" and releases the consumer 
and eventually closes the connection.

I like the idea that there's a list of pending operations that's blocking RUV 
updates. Is there any way for me to examine this list? That said, I do think it 
updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at 
does seem to reflect the increased maxCSN for the bad replica in the RUV I can 
see in the "mapping tree". I've tried to reproduce this small update, but 
haven't been able to yet.

I also have another replica that seems to be experiencing the same problem, and 
I've restarted it with no improvement in symptoms. It might be different, 
though. It doesn't look like it discarded its changelog.

I definitely don't relish reinitializing from this bad replica, though. I'd 
have to perform a rolling reinitialization throughout our whole environment, 
and it takes ages and a lot of effort.

-- 
William Faulk
--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to