Definately all up front. The entire premise of search is that we do as much work at index time as possible so that queries are fast. More importantly, the whole point of the search is to discover what documents the user might want. If you don't index everything from the start you would need a process like:
1. Determine which docs the user wants 2. index them. 3. query the index. But once you've done step 1 you can already just send those results to the user and skip the rest! So with search you index everything you think any user might want, storing the location to find the document at the same time (in a field) when you do your search, the result contains the id of the documents that seem relevant and the location you stored at index time (often a URL). Then you show that list of urls to the user and they click on one (the classic 10 blue links as you see on google). There are more complicated scenarios, and ways to make the display more useful for the user for sure, but that's the basic idea. As for size limit, it depends. Most of the limits are derived from the underlying hardware, and on what metric you are measuring (doc count or size on disk), how much hardware you can afford and what type of documents you are indexing. Lucene has a technical limitation of MAX_INT documents per physical index, but solr allows you to query across multiple physical lucene indexes so that's not a problem. I had a client working with very small documents that indexed 450 billion of them and another with full multi-page documents that had over a billion. If you think you might have anything like those levels, there's some significant work in setting up systems that large, and you may want to hire a consultant to avoid painful and costly mis-steps. (Hardware on amazon for systems of that size costs many hundreds of thousands or more annually) -Gus On Sun, Jan 29, 2023 at 10:19 AM marc nicole <mk1853...@gmail.com> wrote: > Hello - I want to know whether it is common practice to index all the > datasets from the start or the indexation should be performed when the data > is being queried? > Also, is there a size limit on the data to index into Solr? > Thanks. > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)