And make sure you can always reindex the entire data set at any given moment. Solr/search isn’t meant to be a data store nor reliable. It should be able to be destroyed and recreated when ever needed.
> On Jan 29, 2023, at 1:53 PM, marc nicole <mk1853...@gmail.com> wrote: > > so to sum up, it's indexation at data storing time right? > Much appreciated. > >> Le dim. 29 janv. 2023 à 17:59, Gus Heck <gus.h...@gmail.com> a écrit : >> >> Definately all up front. The entire premise of search is that we do as much >> work at index time as possible so that queries are fast. More importantly, >> the whole point of the search is to discover what documents the user might >> want. If you don't index everything from the start you would need a process >> like: >> >> 1. Determine which docs the user wants >> 2. index them. >> 3. query the index. >> >> But once you've done step 1 you can already just send those results to the >> user and skip the rest! So with search you index everything you think any >> user might want, storing the location to find the document at the same time >> (in a field) when you do your search, the result contains the id of the >> documents that seem relevant and the location you stored at index time >> (often a URL). Then you show that list of urls to the user and they click >> on one (the classic 10 blue links as you see on google). There are more >> complicated scenarios, and ways to make the display more useful for the >> user for sure, but that's the basic idea. >> >> As for size limit, it depends. Most of the limits are derived from the >> underlying hardware, and on what metric you are measuring (doc count or >> size on disk), how much hardware you can afford and what type of documents >> you are indexing. Lucene has a technical limitation of MAX_INT documents >> per physical index, but solr allows you to query across multiple physical >> lucene indexes so that's not a problem. I had a client working with very >> small documents that indexed 450 billion of them and another with full >> multi-page documents that had over a billion. If you think you might have >> anything like those levels, there's some significant work in setting up >> systems that large, and you may want to hire a consultant to avoid >> painful and costly mis-steps. (Hardware on amazon for systems of that size >> costs many hundreds of thousands or more annually) >> >> -Gus >> >>> On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk1853...@gmail.com> wrote: >>> >>> Hello - I want to know whether it is common practice to index all the >>> datasets from the start or the indexation should be performed when the >> data >>> is being queried? >>> Also, is there a size limit on the data to index into Solr? >>> Thanks. >>> >> >> >> -- >> http://www.needhamsoftware.com (work) >> http://www.the111shift.com (play) >>