> The 'no' response is traditional and a bit dated.

Agreed, we have been using Solr as a main data store for many years for
some usecases. But, we only store either logs or data that we can reproduce
or regenerate.

The original message wrote about storing a CrawlDB, in that case storing it
in Solr is fine, the data is easy to reproduce in case of distaster.



Op di 5 apr. 2022 om 15:26 schreef James Greene <ja...@jamesaustingreene.com
>:

> The 'no' response is traditional and a bit dated.  If you have proper
> backup/snapshots happening it is totally plausible to use solr (lucene) as
> a primary data store. If you need field/config changes you can import a
> collection from an existing collection doing the field transforms on the
> fly.
>
> There are a growing number of products built on lucene/elastic that act as
> a primary datastore. There is no reason solr can't be used as the same
> outside of the core devs slow response to bugs/documentation but that's a
> topic for questioning using solr at all.
>
> Like all software solutions your system should be designed with redundancy
> and resiliency.
>
> Good Luck!
>
> On Tue, Apr 5, 2022, 12:44 AM Tim Casey <tca...@gmail.com> wrote:
>
> > Srijan,
> >
> > Comments off the top of my head, so buyer beware.
> >
> > Almost always you want to be able to reindex your data from a 'source'.
> > This makes things like indexes not good as a data store, or a source of
> > truth.  The reasons for this vary.  Indexes age out data because there is
> > frequently a weight towards more recent items, indexes need to be
> reindexed
> > for new info to index/issues during indexing/processing, and the list
> would
> > go on.
> >
> > I have built an index data POJO store in lucene a *long* time ago.  It is
> > doable to hydrate a stored object into a language level object, such as a
> > java object instance.  It is fairly straightforward to data model from a
> > 'common' type of data model into an index as a data model.  But, it is
> not
> > quite the same query expectations and so on.  It is is not that far, but
> > again, this is not what the primary focus of an invertible index is.  The
> > primary focus is to take unstructured language data and return results
> in a
> > hopefully well ordered list.
> >
> > So, the first you might do is treat the different sources of data as
> > different clusters with a different topology.  You might stripe the data
> > less and have it be more nodes than you might otherwise because you will
> do
> > less indexing with it, than you might a normal index.  Once you make a
> > decision to separate out the data, then you have to contend with two
> > different indexes having references to the same 'documents' with some id
> to
> > tie them together and you would lose the ability to do any form of
> in-index
> > join using document ids.  If you keep all the data in the same index,
> then
> > you might be in a situation where the common answer is reindex and you
> > would not know what to do about the "metadata".
> >
> > I strongly suspect what you want is to have a way to either maintain the
> > metadata within the index and use it simply as you would along with the
> > documents.  As you spider, keep the info about the document with the
> > document contents.  I cannot think of a reason to keep all of the data
> in a
> > kinda weird separate space.    If you want to be more sophisticated, then
> > you can build an ETL which takes documents and forms indexable units,
> store
> > the indexable units for reindexing.  This is usually pretty quick and
> > separates out the crawling, ETL and indexing/query pieces, for all that
> > means.   This is more complicated, but would be a bit more standard in
> how
> > people think about it.
> >
> > tim
> >
> >
> >
> > On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <apa...@elyograg.org> wrote:
> >
> > > On 4/4/2022 5:52 AM, Srijan wrote:
> > > > I am working on designing a Solr based enterprise search solution.
> One
> > > > requirement I have is to track crawled data from various different
> data
> > > > sources with metadata like crawled date, indexing status and so on. I
> > am
> > > > looking into using Solr itself as my data store and not adding a
> > separate
> > > > database to my stack. Has anyone used Solr as a dedicated data store?
> > How
> > > > did it compare to an RDBMS?
> > >
> > > As you've been told, Solr is NOT a database.  It is most definitely not
> > > equivalent in any way to an RDBMS.  If you want the kinds of things an
> > > RDBMS is good for, you should use an RDBMS, not Solr.
> > >
> > > Handling ever-changing search requirements in Solr is typically going
> to
> > > require the kinds of schema changes that need a full reindex.  So you
> > > probably wouldn't be able to use the same Solr index for your data
> > > storage as you do for searching anyway.
> > >
> > > If you're going to need to set up two Solr installs to handle your
> > > needs, you should probably NOT use Solr for the storage role.  Use
> > > something that has been tested and hardened against data loss. Solr
> does
> > > do its best to never lose data, but guaranteed data durability is not
> > > one of its design goals.  The changes that would be required to make
> > > that guarantee would most likely have an extremely adverse effect on
> > > search performance.
> > >
> > > Solr's core functionality has always been search.  Search is what it's
> > > good at, and that's what will be optimized in future versions ... not
> > > any kind of database functionality.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>

Reply via email to