On 2/29/24 21:31, William Faulk wrote:
Thanks, Pierre and Thierry.

After quite some time of poring over these debug logs, I've found some 
anomalies and they seem like they're matching up with the idea that the 
affected replica isn't updating its own RUV correctly.

The logs show a change being made, and it lists the CSN of the change. The 
first anomalies are here, but they probably aren't terribly significant. The 
CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the 
future from when the change was made and logged. Also, the next part of the CSN 
is supposed to be a serial number for when there are changes made during the 
same second of the timestamp. In the case I was looking at, that serial was 
0xb231. I'm certain that this replica didn't record another 45000 changes in 
that second.

Hi William,

Are you running DS on a VM, container, HW ?
The fact that the CSN timestamp is some time in the future is not frequent but can happen. Generated CSN should always been increasing, so the generation of CSN ajust its timestamp with the received CSN. What looks weird is the number of serial number. Do you have a full error log sample where we can see sequence number moving to such high number (0xb231) ? C



Then it shows the server committing the change to the changelog. It shows it 
"processing data" for over 16000 other CSNs, and it takes about 25 seconds to 
complete.

It then starts a replication session with the peer and prints out the peer's 
(consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for 
itself shows the maxCSN for itself with a timestamp from almost 4 months ago. 
It is greater than the maxCSN for itself in the consumer's RUV, though, by a 
little. (The replicagenerations are equal, though.)
IIUC the consumer is currently catching up. Is the RUV, on the consumer, evolving ?

It then claims to send 7 changes, all of which are skipped because "empty". It then 
claims that there are "No more updates to send" and releases the consumer and eventually 
closes the connection.
Do you have fractional replication ? (some attributes are skipped from replication)

I like the idea that there's a list of pending operations that's blocking RUV updates. Is 
there any way for me to examine this list? That said, I do think it updated its own 
maxCSN in its own RUV by a few hours. The peer I'm looking at does seem to reflect the 
increased maxCSN for the bad replica in the RUV I can see in the "mapping 
tree". I've tried to reproduce this small update, but haven't been able to yet.
difficult to say. pending list has likely a different meaning in my understanding.

I also have another replica that seems to be experiencing the same problem, and 
I've restarted it with no improvement in symptoms. It might be different, 
though. It doesn't look like it discarded its changelog.

I definitely don't relish reinitializing from this bad replica, though. I'd 
have to perform a rolling reinitialization throughout our whole environment, 
and it takes ages and a lot of effort.

--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to