And make sure you can always reindex the entire data set at any given moment. 
Solr/search isn’t meant to be a data store nor reliable. It should be able to 
be destroyed and recreated when ever needed. 

> On Jan 29, 2023, at 1:53 PM, marc nicole <mk1853...@gmail.com> wrote:
> 
> so to sum up, it's indexation at data storing time right?
> Much appreciated.
> 
>> Le dim. 29 janv. 2023 à 17:59, Gus Heck <gus.h...@gmail.com> a écrit :
>> 
>> Definately all up front. The entire premise of search is that we do as much
>> work at index time as possible so that queries are fast. More importantly,
>> the whole point of the search is to discover what documents the user might
>> want. If you don't index everything from the start you would need a process
>> like:
>> 
>> 1. Determine which docs the user wants
>> 2. index them.
>> 3. query the index.
>> 
>> But once  you've done step 1 you can already just send those results to the
>> user and skip the rest! So with search you index everything you think any
>> user might want, storing the location to find the document at the same time
>> (in a field) when you do your search, the result contains the id of the
>> documents that seem relevant and the location you stored at index time
>> (often a URL). Then you show that list of urls to the user and they click
>> on one (the classic 10 blue links as you see on google). There are more
>> complicated scenarios, and ways to make the display more useful for the
>> user for sure, but that's the basic idea.
>> 
>> As for size limit, it depends. Most of the limits are derived from the
>> underlying hardware, and on what metric you are measuring (doc count or
>> size on disk), how much hardware you can afford and what type of documents
>> you are indexing. Lucene has a technical limitation of MAX_INT documents
>> per physical index, but solr allows you to query across multiple physical
>> lucene indexes so that's not a problem. I had a client working with very
>> small documents that indexed 450 billion of them and another with full
>> multi-page documents that had over a billion. If you think you might have
>> anything like those levels, there's some significant work in setting up
>> systems that large, and you may want to hire a consultant to avoid
>> painful and costly mis-steps. (Hardware on amazon for systems of that size
>> costs many hundreds of thousands or more annually)
>> 
>> -Gus
>> 
>>> On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk1853...@gmail.com> wrote:
>>> 
>>> Hello - I want to know whether it is common practice to index all the
>>> datasets from the start or the indexation should be performed when the
>> data
>>> is being queried?
>>> Also, is there a size limit on the data to index into Solr?
>>> Thanks.
>>> 
>> 
>> 
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>> 

Reply via email to