Hi,
Yes that's the mystery here. The replication time reported by Solr is very long 
but it doesn't look like it has done anything useful in that time. The slave is 
logging and comparing the index versions for all the cores (although the logs 
are kind of bad because it isn't clear which core is being compared) with one 
thread per Solr core doing such comparisons. The three cores in question look 
like this at the point where the Solr slave first notices something has 
changed. Remember the update on the master happened at 13:10:30 and hard commit 
time is configured to 20 seconds:

2024-06-20 13:10:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's generation: 15902449
2024-06-20 13:10:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's version: 1718881850712
2024-06-20 13:10:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's generation: 15902448
2024-06-20 13:10:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's version: 1718880403101
2024-06-20 13:10:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2024-06-20 13:10:56.059 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 1812
2024-06-20 13:10:56.749 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.

Second core below. Notice that the time where Solr realizes the index has 
changed is pretty close after the first core. This is right after the index 
updated has been downloaded for the first core

2024-06-20 13:10:56.948 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's generation: 51691
2024-06-20 13:10:56.949 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's version: 1718881853299
2024-06-20 13:10:56.949 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's generation: 51690
2024-06-20 13:10:56.949 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's version: 1718880406231
2024-06-20 13:10:56.949 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2024-06-20 13:10:56.964 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 561

Now the second IndexFetcher thread logs nothing. I would expect it to 
immediately start downloading the new parts of the index (just like it did with 
the first core), not be stuck doing seemingly nothing for 16 seconds. Next 
thing it logs is the one below and after that it proceeds ok and downloads the 
index update etc.

2024-06-20 13:11:12.364 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.

Third core takes longer to notice the index even has changed so here the 
problem may be that the master didn't actually flush the changes to disk in a 
quick enough manner. I cannot see from the master logs exactly when the hard 
commit/flush happens for each core. But here we see that 29 seconds after the 
Solr update request happened on the master the slave still thinks it is the old 
version on the master, but after 39 seconds the new version is available and it 
attempts to start replication.

2024-06-20 13:10:59.070 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's generation: 24269
2024-06-20 13:10:59.070 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's version: 1718880417250
2024-06-20 13:10:59.070 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's generation: 24269
2024-06-20 13:10:59.070 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's version: 1718880417250
2024-06-20 13:10:59.070 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower in sync with leader.
...
2024-06-20 13:11:09.073 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's generation: 24270
2024-06-20 13:11:09.073 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Leader's version: 1718881866063
2024-06-20 13:11:09.073 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's generation: 24269
2024-06-20 13:11:09.073 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Follower's version: 1718880417250
2024-06-20 13:11:09.073 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2024-06-20 13:11:09.083 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 270

Now the same thing as with the second core happens, namely that the thread 
seems to hang for a long time. In this case for 26 seconds before it proceeds 
to download the index update:

2024-06-20 13:11:35.465 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.

My current theory is that this issue is caused by poor mutex logic or maybe 
poor delete logic (or both). I believe this call in IndexFetcher hangs:

          // rollback and reopen index writer and wait until all unused files
          // are successfully deleted
          solrCore.getUpdateHandler().newIndexWriter(true);
          RefCounted<IndexWriter> writer = 
solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(null);

and that replication threads that operate on different cores block each other. 
I redid the same test with DEBUG log level on some key Java classes enabled and 
it resulted in this:

2024-06-23 23:15:55.999 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2024-06-23 23:15:56.957 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2024-06-23 23:16:00.660 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 561
2024-06-23 23:16:00.676 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=aggasset
2024-06-23 23:16:00.676 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2
2024-06-23 23:16:08.124 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 1812
2024-06-23 23:16:08.316 DEBUG (indexFetcher-81-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=xlcore
2024-06-23 23:16:08.316 DEBUG (indexFetcher-81-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2
2024-06-23 23:16:09.072 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process

Here you see all 3 threads have reached "Starting replication process" before 
any index files have been downloaded. Then one thread gets to download files 
and the other two have to wait.

2024-06-23 23:16:09.103 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2024-06-23 23:16:09.121 DEBUG (indexFetcher-81-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=xlcore
2024-06-23 23:16:09.121 DEBUG (indexFetcher-81-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2
...indexFetcher-81-thread-1 gets to download files. It is fairly quick though. 
Second and third thread are blocked from downloading until the first thread has 
reached "New IndexWriter is ready to be used" after downloading (takes one 
extra second for this core).

2024-06-23 23:16:09.470 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Number of files in latest index in leader: 270
2024-06-23 23:16:09.497 DEBUG (indexFetcher-91-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=parentcore
2024-06-23 23:16:09.498 DEBUG (indexFetcher-91-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2
2024-06-23 23:16:10.165 DEBUG (indexFetcher-81-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2
2024-06-23 23:16:10.165 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.

But then for unknown reasons it takes another 6 seconds before the next thread 
can start downloading:

2024-06-23 23:16:16.703 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2
2024-06-23 23:16:16.703 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2024-06-23 23:16:16.707 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=defcore
2024-06-23 23:16:16.707 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2
...indexFetcher-89-thread-1 gets to download files. But here we have buggy 
behavior. The thread finishes at:
2024-06-23 23:16:16.752 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Moving file: segments_13vw

Then indexFetcher-89-thread-1 doesn't log anything for 15 seconds and then logs 
this:

2024-06-23 23:16:31.370 DEBUG (indexFetcher-89-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2
2024-06-23 23:16:31.370 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.

A wild guess here but maybe that was cache warming that took 15 seconds(??). 
But while doing that cache warming the third thread is blocked from replicating 
data. That definitely feels like a bug to me. A bit later the third thread 
finally gets to replicate data, but again an extra 5 seconds delay here that I 
cannot explain.

2024-06-23 23:16:36.100 DEBUG (indexFetcher-91-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2
2024-06-23 23:16:36.100 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2024-06-23 23:16:36.103 DEBUG (indexFetcher-91-thread-1) [   ] 
o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=parentcore
2024-06-23 23:16:36.103 DEBUG (indexFetcher-91-thread-1) [   ] 
o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2

Kind regards,

Marcus

________________________________
From: Marcus Bergner <marcus.berg...@vizrt.com>
Sent: Thursday, June 20, 2024 16:00
To: users@solr.apache.org <users@solr.apache.org>
Subject: Solr replication delays in IndexFetcher

Hi,
I'm using a traditional master/replica Solr (8.11) setup and I'm trying to tune 
Solr's autoCommitTimeout, autoSoftCommitTimeout on the Solr master and the 
pollInterval on the replicas to achieve an overall better indexing throughput 
while still maintaining an acceptably low indexing latency on the replicas. The 
indexing latencies on the replicas are much longer than I would expect and I 
don't understand why so I'm hoping someone here might have some insights on 
what the possible cause is and what can be done about it.

On a test environment with a large amount of test data already indexed and 
replicated I make one small update which cause a couple of documents in 3 Solr 
cores to be updated (one update request per core sent to Solr's API).
The Solr master log file shows all three /update requests coming in at 
13:10:30. The 3 indexing requests are all done WITHOUT explicitly specified 
"commit=true" or "softCommit=true". I.e. only the solrconfig.xml specified auto 
commit max times should affect when commits take place.
Currently the autoCommit maxTime is set to 20000 and the autoSoftCommit maxTime 
is 2000 but I have also tried higher autoCommit maxTime values with similarly 
confusing results.

I have a pollInterval of 00:00:10 on the replica. When making the above index 
updates and issuing search queries against the replica it takes several minutes 
before I get a corresponding search hit from the replica. In some cases 3-4 
minutes, sometimes a bit less.

I the following strange behavior in the logs of the Solr replica. Replica seems 
to notice something has changed after 25-26 seconds (ok assuming autoCommit 
maxTime is 20 seconds and pollInterval is 10 seconds)

2024-06-20 13:10:56.768 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting download (fullCopy=false) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
... most files being skipped, "Fetched and wrote" 15 files
2024-06-20 13:10:56.841 INFO  (indexFetcher-81-thread-1) [   ] 
o.a.s.h.IndexFetcher Total time taken for download 
(fullCopy=false,bytesDownloaded=225681) : 0 secs (null bytes/sec) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)

So far so good, but this is only one of the three cores that was updated at 
13:10:30. The second core is processed much later:

2024-06-20 13:11:12.370 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting download (fullCopy=false) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
...
2024-06-20 13:11:12.409 INFO  (indexFetcher-89-thread-1) [   ] 
o.a.s.h.IndexFetcher Total time taken for download 
(fullCopy=false,bytesDownloaded=281548) : 15 secs (18769 bytes/sec) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)

and the third one even more later:

2024-06-20 13:11:35.468 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting download (fullCopy=false) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)
...
2024-06-20 13:11:35.498 INFO  (indexFetcher-91-thread-1) [   ] 
o.a.s.h.IndexFetcher Total time taken for download 
(fullCopy=false,bytesDownloaded=221332) : 26 secs (8512 bytes/sec) to 
NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; 
maxCacheMB=48.0 maxMergeSizeMB=4.0)

How can I get all updated cores to be replicated within 1 autoCommit maxTime + 
1 pollInterval time frame, or at the very least 2 autoCommit maxTime + 1 
pollInterval? Right now it looks like only one core is being replicated, then 
there is 15-25 seconds of doing nothing, then replicating another core, 15-25 
seconds of doing nothing etc.

Kind regards,

Marcus

Reply via email to