Hi, I'm using a traditional master/replica Solr (8.11) setup and I'm trying to tune Solr's autoCommitTimeout, autoSoftCommitTimeout on the Solr master and the pollInterval on the replicas to achieve an overall better indexing throughput while still maintaining an acceptably low indexing latency on the replicas. The indexing latencies on the replicas are much longer than I would expect and I don't understand why so I'm hoping someone here might have some insights on what the possible cause is and what can be done about it.
On a test environment with a large amount of test data already indexed and replicated I make one small update which cause a couple of documents in 3 Solr cores to be updated (one update request per core sent to Solr's API). The Solr master log file shows all three /update requests coming in at 13:10:30. The 3 indexing requests are all done WITHOUT explicitly specified "commit=true" or "softCommit=true". I.e. only the solrconfig.xml specified auto commit max times should affect when commits take place. Currently the autoCommit maxTime is set to 20000 and the autoSoftCommit maxTime is 2000 but I have also tried higher autoCommit maxTime values with similarly confusing results. I have a pollInterval of 00:00:10 on the replica. When making the above index updates and issuing search queries against the replica it takes several minutes before I get a corresponding search hit from the replica. In some cases 3-4 minutes, sometimes a bit less. I the following strange behavior in the logs of the Solr replica. Replica seems to notice something has changed after 25-26 seconds (ok assuming autoCommit maxTime is 20 seconds and pollInterval is 10 seconds) 2024-06-20 13:10:56.768 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... most files being skipped, "Fetched and wrote" 15 files 2024-06-20 13:10:56.841 INFO (indexFetcher-81-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=225681) : 0 secs (null bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/xlcore/data/index.20240620131056059 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) So far so good, but this is only one of the three cores that was updated at 13:10:30. The second core is processed much later: 2024-06-20 13:11:12.370 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:12.409 INFO (indexFetcher-89-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=281548) : 15 secs (18769 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/defcore/data/index.20240620131056964 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) and the third one even more later: 2024-06-20 13:11:35.468 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... 2024-06-20 13:11:35.498 INFO (indexFetcher-91-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false,bytesDownloaded=221332) : 26 secs (8512 bytes/sec) to NRTCachingDirectory(MMapDirectory@/data0/solr8/parentcore/data/index.20240620131109083 lockFactory=org.apache.lucene.store.NativeFSLockFactory@25fb1467; maxCacheMB=48.0 maxMergeSizeMB=4.0) How can I get all updated cores to be replicated within 1 autoCommit maxTime + 1 pollInterval time frame, or at the very least 2 autoCommit maxTime + 1 pollInterval? Right now it looks like only one core is being replicated, then there is 15-25 seconds of doing nothing, then replicating another core, 15-25 seconds of doing nothing etc. Kind regards, Marcus