Solr is not a "good" primary data store. Solr is built for finding your documents, not storing them. A good primary data store holds stuff indefinitely without adding weight and without changing regularly, Solr doesn't fit that description. One of the biggest reasons for this is that at some point you'll want to upgrade to the latest version, and we only support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was actually enforced by the code but I didn't find the check on a quick look through the code (doesn't mean it isn't there, just I didn't find it). In any case, multi-version upgrades are not generally supported, intentionally so that we can be free to make improvements without having to carry an ever growing weight of back compatibility. Typically if new index features are developed and you want to use them (like when doc values were introduced) you will need to re-index to use the new feature. Search engines precalculate and write typically denormalized or otherwise processed information into the index prioritizing speed of retrieval over space and long term storage. As others have mentioned, there is also the ever changing requirements problem. Typically someone in product management or if you are unlucky, the CEO hears of something cool someone did with solr and says: Hey, let's do that too! I bet it would really draw customers in!... 9 times out of 10 the new thing involves changing the way something is analyzed, or adding a new analysis of previously ingested data. If you can't reindex you have to be able to say no, not on old data" and possibly say "we'll need a separate collection for the new data and it will be difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
The ability to add fields to documents is much more for things like adding searchable geo-located gps coordinates for documents that have a location or metadata like you mention to a document than for storing the document content itself. It is *possible* to have self re-indexing documents that contain all the data needed to repeat indexing but it takes a lot of space and, slows down your index. Furthermore it will requires that all indexing enrichment/cleaning/etc be baked inside solr using updateProcessorFactories... which in turn makes all that indexing work compete more heavily with search queries... or alternately requires that the data has to be queried out and inserted back in after external processing, which is also going to compete with user queries (so maybe one winds up fielding extra hardware or even 2 clusters - twice as many machines - and swapping clusters back and forth periodically, now it's, complex, expensive and has a very high index latency instead of slow queries... no free lunch there) Trying to store the original data just complicates matters. Keeping it simple and using solr to find things that are then served from a primary source is really the best place to start. So yeah, you *could* use it as a primary store with work and acceptance of limitations, but you have to be aware of what you are doing, and have a decently working crystal ball. I never advise clients to do this because I prefer happy clients that say nice things about me :) So my advice to you is don't do it unless there is an extremely compelling reason. Assuming you're not dealing with really massive amounts of data, just indexing some internal intranet (and it's not something the size of apple or google), then for your use case, crawling pages, I'd have the crawler drop anything it finds and considers worthy of indexing to a filesystem (maybe 2 files, the content and a file with metadata like the link where it was found), have a separate indexing process scan the filesystem periodically, munge it for metadata or whatever other manipulations are useful and then write the result to solr. If the crawl store is designed so the same document always lands in the same location and you don't have to worry about growth other than the growth of the site(s) you are indexing. There are ways to improve on things from there such as adding a kafka instance for a topic that identifies newly fetched docs to prevent (or augment) the periodic scanning. Also storing a hash of the content in a database to let the indexer ignore when the crawler simply downloaded the same bytes, cause nothing's changed... And you'll want to decide if you want to remove references to pages that disappeared or detect moves/renames vs deletions which is a whole thing of its own... My side project JesterJ.org <https://www.JesterJ.org> provides a good deal of the indexer features I describe (but it still needs a kafka connector, contributions welcome :) ). Some folks have used it profitably, but it's admittedly still rough, and the current master is much better than the now ancient, last released beta (which probably should have been an alpha but oh well :) -Gus On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <dominique.bej...@eolya.fr> wrote: > Hi, > > A best practice for performances and ressources usage is to store and/or > index and/or docValues only data required for your search features. > However, in order to implement or modify new or existing features in an > index you will need to reindex all the data in this index. > > I propose 2 solutions : > > - The first one is to store the full original JSON data into the _str_ > fields of the index. > > > https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default > > > - The second and the best solution in my opinion is to store the JSON > data into an intermediate feature neutral data store as a file simple > file > system or better a MongoDB database. This way will allow you to use your > data in several indexes (one index for search, one index for suggesters, > ...) without duplicating data into _src_ fields in each index. A uuid > in > each index will allow you to get the full JSON object in MongoDB. > > > Obviously a key point is the backup strategy of your data store according > to the solution you choose : either Solr indexes or the file system or the > MongoDB database. > > Dominique > > > > > > > > > > > > > > > > > > Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit : > > > Hi All, > > > > I am working on designing a Solr based enterprise search solution. One > > requirement I have is to track crawled data from various different data > > sources with metadata like crawled date, indexing status and so on. I am > > looking into using Solr itself as my data store and not adding a separate > > database to my stack. Has anyone used Solr as a dedicated data store? How > > did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl > > DB - can someone here share some insight into how Fusion is using this > > 'DB'? My store will need to track millions of objects and be able to > handle > > parallel adds/updates. Do you think Solr is a good tool for this or am I > > better off depending on a database service? > > > > Thanks a bunch. > > > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)