Sorry, incompletely edited. Should be “I use moderate sized batches…” 

The two threads per CPU thing also works if you want to keep some CPU available 
for queries. So a 4 CPU machine being indexed with 4 threads will have roughly 
2 CPUs available for queries. Very roughly.

wunder

> On Nov 26, 2024, at 8:06 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> Use multiple threads to send batches. I use two moderate sized batches and 
> two threads per CPU. You can tune it until you see near 100% CPU utilization. 
> 
> Why two client threads per CPU? Roughly, one batch being processed by the CPU 
> and one batch in flight over the network, so it is ready to be processed.
> 
> Indexing is CPU-intensive, so once it approaches 100% utilization, it is 
> maxed out.
> 
> Add more CPUs to go faster. 
> 
> I doubt that messing with commits will make a meaningful difference. Use auto 
> commit so the indexing threads aren’t waiting.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 26, 2024, at 5:46 AM, ufuk yılmaz <uyil...@vivaldi.net.invalid> wrote:
>> 
>> Hello Noah
>> 
>> I remember a trick but I didn’t try it myself before. Turn off all soft and 
>> hard commits and do a singular manual commit at the end    .I don’t know if 
>> it can work for the whole 40 million documents but it might speed up 
>> indexing when done in large chunks. 
>> 
>> —ufuk
>> 
>> —
>> 
>>> On Nov 26, 2024, at 22:05, Noah Torp-Smith <n...@dbc.dk.invalid> wrote:
>>> 
>>> Hello,
>>> 
>>> We have a setup where we periodically index a solr “offline” and then copy 
>>> the data folder to a storage location. When we then deploy our solrs to 
>>> production, the containers then download that data folder to the right 
>>> place in the file system before the solr server is started. After the solr 
>>> is started, it is never updated, we just tear it down and replace on the 
>>> next cycle.
>>> This works ok, but I was wondering if there are any tweaks one could apply 
>>> to make the indexing go faster, when we know that there will be no searches 
>>> during the time we are indexing? The corpus we are indexing is around 40 
>>> million documents, and most of the time is spent on waiting for commits. We 
>>> commit every 5 million documents. Does that sound reasonable? Should we 
>>> commit more often? Or should we just commit at the end?
>>> 
>>> I am aware that there is a lot of context I have not provided here. I am 
>>> just looking for any advice I can get for this kind of setup.
>>> 
>>> Kind regards,
>>> /Noah
>> 
> 

Reply via email to