This is one of the most interesting and articulate emails I’ve read about the fundamentals in a long time. Saving this one :)
> On Apr 7, 2022, at 9:32 PM, Gus Heck <gus.h...@gmail.com> wrote: > > Solr is not a "good" primary data store. Solr is built for finding your > documents, not storing them. A good primary data store holds stuff > indefinitely without adding weight and without changing regularly, Solr > doesn't fit that description. One of the biggest reasons for this is that > at some point you'll want to upgrade to the latest version, and we only > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was > actually enforced by the code but I didn't find the check on a quick look > through the code (doesn't mean it isn't there, just I didn't find it). In > any case, multi-version upgrades are not generally supported, intentionally > so that we can be free to make improvements without having to carry an ever > growing weight of back compatibility. Typically if new index features are > developed and you want to use them (like when doc values were introduced) > you will need to re-index to use the new feature. Search engines > precalculate and write typically denormalized or otherwise processed > information into the index prioritizing speed of retrieval over space and > long term storage. As others have mentioned, there is also the ever > changing requirements problem. Typically someone in product management or > if you are unlucky, the CEO hears of something cool someone did with solr > and says: Hey, let's do that too! I bet it would really draw customers > in!... 9 times out of 10 the new thing involves changing the way something > is analyzed, or adding a new analysis of previously ingested data. If you > can't reindex you have to be able to say no, not on old data" and possibly > say "we'll need a separate collection for the new data and it will be > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc. > > The ability to add fields to documents is much more for things like adding > searchable geo-located gps coordinates for documents that have a location > or metadata like you mention to a document than for storing the document > content itself. It is *possible* to have self re-indexing documents that > contain all the data needed to repeat indexing but it takes a lot of space > and, slows down your index. Furthermore it will requires that all indexing > enrichment/cleaning/etc be baked inside solr using > updateProcessorFactories... which in turn makes all that indexing work > compete more heavily with search queries... or alternately requires that > the data has to be queried out and inserted back in after external > processing, which is also going to compete with user queries (so maybe one > winds up fielding extra hardware or even 2 clusters - twice as many > machines - and swapping clusters back and forth periodically, now it's, > complex, expensive and has a very high index latency instead of slow > queries... no free lunch there) Trying to store the original data just > complicates matters. Keeping it simple and using solr to find things that > are then served from a primary source is really the best place to start. > > So yeah, you *could* use it as a primary store with work and acceptance of > limitations, but you have to be aware of what you are doing, and have a > decently working crystal ball. I never advise clients to do this because I > prefer happy clients that say nice things about me :) So my advice to you > is don't do it unless there is an extremely compelling reason. > > Assuming you're not dealing with really massive amounts of data, just > indexing some internal intranet (and it's not something the size of apple > or google), then for your use case, crawling pages, I'd have the crawler > drop anything it finds and considers worthy of indexing to a filesystem > (maybe 2 files, the content and a file with metadata like the link where it > was found), have a separate indexing process scan the filesystem > periodically, munge it for metadata or whatever other manipulations are > useful and then write the result to solr. If the crawl store is designed so > the same document always lands in the same location and you don't have to > worry about growth other than the growth of the site(s) you are indexing. > There are ways to improve on things from there such as adding a kafka > instance for a topic that identifies newly fetched docs to prevent (or > augment) the periodic scanning. Also storing a hash of the content in a > database to let the indexer ignore when the crawler simply downloaded the > same bytes, cause nothing's changed... > > And you'll want to decide if you want to remove references to pages that > disappeared or detect moves/renames vs deletions which is a whole thing of > its own... > > My side project JesterJ.org <https://www.JesterJ.org> provides a good deal > of the indexer features I describe (but it still needs a kafka connector, > contributions welcome :) ). Some folks have used it profitably, but it's > admittedly still rough, and the current master is much better than the now > ancient, last released beta (which probably should have been an alpha but > oh well :) > > -Gus > >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <dominique.bej...@eolya.fr> >> wrote: >> >> Hi, >> >> A best practice for performances and ressources usage is to store and/or >> index and/or docValues only data required for your search features. >> However, in order to implement or modify new or existing features in an >> index you will need to reindex all the data in this index. >> >> I propose 2 solutions : >> >> - The first one is to store the full original JSON data into the _str_ >> fields of the index. >> >> >> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default >> >> >> - The second and the best solution in my opinion is to store the JSON >> data into an intermediate feature neutral data store as a file simple >> file >> system or better a MongoDB database. This way will allow you to use your >> data in several indexes (one index for search, one index for suggesters, >> ...) without duplicating data into _src_ fields in each index. A uuid >> in >> each index will allow you to get the full JSON object in MongoDB. >> >> >> Obviously a key point is the backup strategy of your data store according >> to the solution you choose : either Solr indexes or the file system or the >> MongoDB database. >> >> Dominique >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit : >>> >>> Hi All, >>> >>> I am working on designing a Solr based enterprise search solution. One >>> requirement I have is to track crawled data from various different data >>> sources with metadata like crawled date, indexing status and so on. I am >>> looking into using Solr itself as my data store and not adding a separate >>> database to my stack. Has anyone used Solr as a dedicated data store? How >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl >>> DB - can someone here share some insight into how Fusion is using this >>> 'DB'? My store will need to track millions of objects and be able to >> handle >>> parallel adds/updates. Do you think Solr is a good tool for this or am I >>> better off depending on a database service? >>> >>> Thanks a bunch. >>> >> > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play)