Hi, Yes that's the mystery here. The replication time reported by Solr is very long but it doesn't look like it has done anything useful in that time. The slave is logging and comparing the index versions for all the cores (although the logs are kind of bad because it isn't clear which core is being compared) with one thread per Solr core doing such comparisons. The three cores in question look like this at the point where the Solr slave first notices something has changed. Remember the update on the master happened at 13:10:30 and hard commit time is configured to 20 seconds:
2024-06-20 13:10:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Leader's generation: 15902449 2024-06-20 13:10:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Leader's version: 1718881850712 2024-06-20 13:10:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Follower's generation: 15902448 2024-06-20 13:10:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Follower's version: 1718880403101 2024-06-20 13:10:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-06-20 13:10:56.059 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 1812 2024-06-20 13:10:56.749 INFO (indexFetcher-81-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. Second core below. Notice that the time where Solr realizes the index has changed is pretty close after the first core. This is right after the index updated has been downloaded for the first core 2024-06-20 13:10:56.948 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Leader's generation: 51691 2024-06-20 13:10:56.949 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Leader's version: 1718881853299 2024-06-20 13:10:56.949 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Follower's generation: 51690 2024-06-20 13:10:56.949 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Follower's version: 1718880406231 2024-06-20 13:10:56.949 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-06-20 13:10:56.964 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 561 Now the second IndexFetcher thread logs nothing. I would expect it to immediately start downloading the new parts of the index (just like it did with the first core), not be stuck doing seemingly nothing for 16 seconds. Next thing it logs is the one below and after that it proceeds ok and downloads the index update etc. 2024-06-20 13:11:12.364 INFO (indexFetcher-89-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. Third core takes longer to notice the index even has changed so here the problem may be that the master didn't actually flush the changes to disk in a quick enough manner. I cannot see from the master logs exactly when the hard commit/flush happens for each core. But here we see that 29 seconds after the Solr update request happened on the master the slave still thinks it is the old version on the master, but after 39 seconds the new version is available and it attempts to start replication. 2024-06-20 13:10:59.070 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Leader's generation: 24269 2024-06-20 13:10:59.070 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Leader's version: 1718880417250 2024-06-20 13:10:59.070 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Follower's generation: 24269 2024-06-20 13:10:59.070 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Follower's version: 1718880417250 2024-06-20 13:10:59.070 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Follower in sync with leader. ... 2024-06-20 13:11:09.073 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Leader's generation: 24270 2024-06-20 13:11:09.073 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Leader's version: 1718881866063 2024-06-20 13:11:09.073 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Follower's generation: 24269 2024-06-20 13:11:09.073 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Follower's version: 1718880417250 2024-06-20 13:11:09.073 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-06-20 13:11:09.083 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 270 Now the same thing as with the second core happens, namely that the thread seems to hang for a long time. In this case for 26 seconds before it proceeds to download the index update: 2024-06-20 13:11:35.465 INFO (indexFetcher-91-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. My current theory is that this issue is caused by poor mutex logic or maybe poor delete logic (or both). I believe this call in IndexFetcher hangs: // rollback and reopen index writer and wait until all unused files // are successfully deleted solrCore.getUpdateHandler().newIndexWriter(true); RefCounted<IndexWriter> writer = solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(null); and that replication threads that operate on different cores block each other. I redid the same test with DEBUG log level on some key Java classes enabled and it resulted in this: 2024-06-23 23:15:55.999 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-06-23 23:15:56.957 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-06-23 23:16:00.660 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 561 2024-06-23 23:16:00.676 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=aggasset 2024-06-23 23:16:00.676 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 2024-06-23 23:16:08.124 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 1812 2024-06-23 23:16:08.316 DEBUG (indexFetcher-81-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=xlcore 2024-06-23 23:16:08.316 DEBUG (indexFetcher-81-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 2024-06-23 23:16:09.072 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process Here you see all 3 threads have reached "Starting replication process" before any index files have been downloaded. Then one thread gets to download files and the other two have to wait. 2024-06-23 23:16:09.103 INFO (indexFetcher-81-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. 2024-06-23 23:16:09.121 DEBUG (indexFetcher-81-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=xlcore 2024-06-23 23:16:09.121 DEBUG (indexFetcher-81-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 ...indexFetcher-81-thread-1 gets to download files. It is fairly quick though. Second and third thread are blocked from downloading until the first thread has reached "New IndexWriter is ready to be used" after downloading (takes one extra second for this core). 2024-06-23 23:16:09.470 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in leader: 270 2024-06-23 23:16:09.497 DEBUG (indexFetcher-91-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=parentcore 2024-06-23 23:16:09.498 DEBUG (indexFetcher-91-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 2024-06-23 23:16:10.165 DEBUG (indexFetcher-81-thread-1) [ ] o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2 2024-06-23 23:16:10.165 INFO (indexFetcher-81-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. But then for unknown reasons it takes another 6 seconds before the next thread can start downloading: 2024-06-23 23:16:16.703 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2 2024-06-23 23:16:16.703 INFO (indexFetcher-89-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. 2024-06-23 23:16:16.707 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=defcore 2024-06-23 23:16:16.707 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 ...indexFetcher-89-thread-1 gets to download files. But here we have buggy behavior. The thread finishes at: 2024-06-23 23:16:16.752 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Moving file: segments_13vw Then indexFetcher-89-thread-1 doesn't log anything for 15 seconds and then logs this: 2024-06-23 23:16:31.370 DEBUG (indexFetcher-89-thread-1) [ ] o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2 2024-06-23 23:16:31.370 INFO (indexFetcher-89-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. A wild guess here but maybe that was cache warming that took 15 seconds(??). But while doing that cache warming the third thread is blocked from replicating data. That definitely feels like a bug to me. A bit later the third thread finally gets to replicate data, but again an extra 5 seconds delay here that I cannot explain. 2024-06-23 23:16:36.100 DEBUG (indexFetcher-91-thread-1) [ ] o.a.s.u.SolrIndexWriter Opened Writer DirectUpdateHandler2 2024-06-23 23:16:36.100 INFO (indexFetcher-91-thread-1) [ ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used. 2024-06-23 23:16:36.103 DEBUG (indexFetcher-91-thread-1) [ ] o.a.s.u.DefaultSolrCoreState Rollback old IndexWriter... core=parentcore 2024-06-23 23:16:36.103 DEBUG (indexFetcher-91-thread-1) [ ] o.a.s.u.SolrIndexWriter Rollback Writer DirectUpdateHandler2 Kind regards, Marcus ________________________________ From: Marcus Bergner <marcus.berg...@vizrt.com> Sent: Thursday, June 20, 2024 16:00 To: users@solr.apache.org <users@solr.apache.org> Subject: Solr replication delays in IndexFetcher Hi, I'm using a traditional master/replica Solr (8.11) setup and I'm trying to tune Solr's autoCommitTimeout, autoSoftCommitTimeout on the Solr master and the pollInterval on the replicas to achieve an overall better indexing throughput while still maintaining an acceptably low indexing latency on the replicas. The indexing latencies on the replicas are much longer than I would expect and I don't understand why so I'm hoping someone here might have some insights on what the possible cause is and what can be done about it. On a test environment with a large amount of test data already indexed and replicated I make one small update which cause a couple of documents in 3 Solr cores to be updated (one update request per core sent to Solr's API). The Solr master log file shows all three /update requests coming in at 13:10:30. The 3 indexing requests are all done WITHOUT explicitly specified "commit=true" or "softCommit=true". I.e. only the solrconfig.xml specified auto commit max times should affect when commits take place. Currently the autoCommit maxTime is set to 20000 and the autoSoftCommit maxTime is 2000 but I have also tried higher autoCommit maxTime values with similarly confusing results. I have a pollInterval of 00:00:10 on the replica. When making the above index updates and issuing search queries against the replica it takes several minutes before I get a corresponding search hit from the replica. In some cases 3-4 minutes, sometimes a bit less. I the following strange behavior in the logs of the Solr replica. Replica seems to notice something has changed after 25-26 seconds (ok assuming autoCommit maxTime is 20 seconds and pollInterval is 10 seconds) 2024-06-20 13:10:56.768 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... most files being skipped, "Fetched and wrote" 15 files 2024-06-20 13:10:56.841 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=225681) : 0 secs (null bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) So far so good, but this is only one of the three cores that was updated at 13:10:30. The second core is processed much later: 2024-06-20 13:11:12.370 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:12.409 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=281548) : 15 secs (18769 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) and the third one even more later: 2024-06-20 13:11:35.468 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:35.498 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=221332) : 26 secs (8512 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) How can I get all updated cores to be replicated within 1 autoCommit maxTime + 1 pollInterval time frame, or at the very least 2 autoCommit maxTime + 1 pollInterval? Right now it looks like only one core is being replicated, then there is 15-25 seconds of doing nothing, then replicating another core, 15-25 seconds of doing nothing etc. Kind regards, Marcus