Sorry, incompletely edited. Should be “I use moderate sized batches…”
The two threads per CPU thing also works if you want to keep some CPU available for queries. So a 4 CPU machine being indexed with 4 threads will have roughly 2 CPUs available for queries. Very roughly. wunder > On Nov 26, 2024, at 8:06 AM, Walter Underwood <wun...@wunderwood.org> wrote: > > Use multiple threads to send batches. I use two moderate sized batches and > two threads per CPU. You can tune it until you see near 100% CPU utilization. > > Why two client threads per CPU? Roughly, one batch being processed by the CPU > and one batch in flight over the network, so it is ready to be processed. > > Indexing is CPU-intensive, so once it approaches 100% utilization, it is > maxed out. > > Add more CPUs to go faster. > > I doubt that messing with commits will make a meaningful difference. Use auto > commit so the indexing threads aren’t waiting. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Nov 26, 2024, at 5:46 AM, ufuk yılmaz <uyil...@vivaldi.net.invalid> wrote: >> >> Hello Noah >> >> I remember a trick but I didn’t try it myself before. Turn off all soft and >> hard commits and do a singular manual commit at the end .I don’t know if >> it can work for the whole 40 million documents but it might speed up >> indexing when done in large chunks. >> >> —ufuk >> >> — >> >>> On Nov 26, 2024, at 22:05, Noah Torp-Smith <n...@dbc.dk.invalid> wrote: >>> >>> Hello, >>> >>> We have a setup where we periodically index a solr “offline” and then copy >>> the data folder to a storage location. When we then deploy our solrs to >>> production, the containers then download that data folder to the right >>> place in the file system before the solr server is started. After the solr >>> is started, it is never updated, we just tear it down and replace on the >>> next cycle. >>> This works ok, but I was wondering if there are any tweaks one could apply >>> to make the indexing go faster, when we know that there will be no searches >>> during the time we are indexing? The corpus we are indexing is around 40 >>> million documents, and most of the time is spent on waiting for commits. We >>> commit every 5 million documents. Does that sound reasonable? Should we >>> commit more often? Or should we just commit at the end? >>> >>> I am aware that there is a lot of context I have not provided here. I am >>> just looking for any advice I can get for this kind of setup. >>> >>> Kind regards, >>> /Noah >> >