> The 'no' response is traditional and a bit dated. Agreed, we have been using Solr as a main data store for many years for some usecases. But, we only store either logs or data that we can reproduce or regenerate.
The original message wrote about storing a CrawlDB, in that case storing it in Solr is fine, the data is easy to reproduce in case of distaster. Op di 5 apr. 2022 om 15:26 schreef James Greene <ja...@jamesaustingreene.com >: > The 'no' response is traditional and a bit dated. If you have proper > backup/snapshots happening it is totally plausible to use solr (lucene) as > a primary data store. If you need field/config changes you can import a > collection from an existing collection doing the field transforms on the > fly. > > There are a growing number of products built on lucene/elastic that act as > a primary datastore. There is no reason solr can't be used as the same > outside of the core devs slow response to bugs/documentation but that's a > topic for questioning using solr at all. > > Like all software solutions your system should be designed with redundancy > and resiliency. > > Good Luck! > > On Tue, Apr 5, 2022, 12:44 AM Tim Casey <tca...@gmail.com> wrote: > > > Srijan, > > > > Comments off the top of my head, so buyer beware. > > > > Almost always you want to be able to reindex your data from a 'source'. > > This makes things like indexes not good as a data store, or a source of > > truth. The reasons for this vary. Indexes age out data because there is > > frequently a weight towards more recent items, indexes need to be > reindexed > > for new info to index/issues during indexing/processing, and the list > would > > go on. > > > > I have built an index data POJO store in lucene a *long* time ago. It is > > doable to hydrate a stored object into a language level object, such as a > > java object instance. It is fairly straightforward to data model from a > > 'common' type of data model into an index as a data model. But, it is > not > > quite the same query expectations and so on. It is is not that far, but > > again, this is not what the primary focus of an invertible index is. The > > primary focus is to take unstructured language data and return results > in a > > hopefully well ordered list. > > > > So, the first you might do is treat the different sources of data as > > different clusters with a different topology. You might stripe the data > > less and have it be more nodes than you might otherwise because you will > do > > less indexing with it, than you might a normal index. Once you make a > > decision to separate out the data, then you have to contend with two > > different indexes having references to the same 'documents' with some id > to > > tie them together and you would lose the ability to do any form of > in-index > > join using document ids. If you keep all the data in the same index, > then > > you might be in a situation where the common answer is reindex and you > > would not know what to do about the "metadata". > > > > I strongly suspect what you want is to have a way to either maintain the > > metadata within the index and use it simply as you would along with the > > documents. As you spider, keep the info about the document with the > > document contents. I cannot think of a reason to keep all of the data > in a > > kinda weird separate space. If you want to be more sophisticated, then > > you can build an ETL which takes documents and forms indexable units, > store > > the indexable units for reindexing. This is usually pretty quick and > > separates out the crawling, ETL and indexing/query pieces, for all that > > means. This is more complicated, but would be a bit more standard in > how > > people think about it. > > > > tim > > > > > > > > On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <apa...@elyograg.org> wrote: > > > > > On 4/4/2022 5:52 AM, Srijan wrote: > > > > I am working on designing a Solr based enterprise search solution. > One > > > > requirement I have is to track crawled data from various different > data > > > > sources with metadata like crawled date, indexing status and so on. I > > am > > > > looking into using Solr itself as my data store and not adding a > > separate > > > > database to my stack. Has anyone used Solr as a dedicated data store? > > How > > > > did it compare to an RDBMS? > > > > > > As you've been told, Solr is NOT a database. It is most definitely not > > > equivalent in any way to an RDBMS. If you want the kinds of things an > > > RDBMS is good for, you should use an RDBMS, not Solr. > > > > > > Handling ever-changing search requirements in Solr is typically going > to > > > require the kinds of schema changes that need a full reindex. So you > > > probably wouldn't be able to use the same Solr index for your data > > > storage as you do for searching anyway. > > > > > > If you're going to need to set up two Solr installs to handle your > > > needs, you should probably NOT use Solr for the storage role. Use > > > something that has been tested and hardened against data loss. Solr > does > > > do its best to never lose data, but guaranteed data durability is not > > > one of its design goals. The changes that would be required to make > > > that guarantee would most likely have an extremely adverse effect on > > > search performance. > > > > > > Solr's core functionality has always been search. Search is what it's > > > good at, and that's what will be optimized in future versions ... not > > > any kind of database functionality. > > > > > > Thanks, > > > Shawn > > > > > > > > >