Srijan,

Comments off the top of my head, so buyer beware.

Almost always you want to be able to reindex your data from a 'source'.
This makes things like indexes not good as a data store, or a source of
truth.  The reasons for this vary.  Indexes age out data because there is
frequently a weight towards more recent items, indexes need to be reindexed
for new info to index/issues during indexing/processing, and the list would
go on.

I have built an index data POJO store in lucene a *long* time ago.  It is
doable to hydrate a stored object into a language level object, such as a
java object instance.  It is fairly straightforward to data model from a
'common' type of data model into an index as a data model.  But, it is not
quite the same query expectations and so on.  It is is not that far, but
again, this is not what the primary focus of an invertible index is.  The
primary focus is to take unstructured language data and return results in a
hopefully well ordered list.

So, the first you might do is treat the different sources of data as
different clusters with a different topology.  You might stripe the data
less and have it be more nodes than you might otherwise because you will do
less indexing with it, than you might a normal index.  Once you make a
decision to separate out the data, then you have to contend with two
different indexes having references to the same 'documents' with some id to
tie them together and you would lose the ability to do any form of in-index
join using document ids.  If you keep all the data in the same index, then
you might be in a situation where the common answer is reindex and you
would not know what to do about the "metadata".

I strongly suspect what you want is to have a way to either maintain the
metadata within the index and use it simply as you would along with the
documents.  As you spider, keep the info about the document with the
document contents.  I cannot think of a reason to keep all of the data in a
kinda weird separate space.    If you want to be more sophisticated, then
you can build an ETL which takes documents and forms indexable units, store
the indexable units for reindexing.  This is usually pretty quick and
separates out the crawling, ETL and indexing/query pieces, for all that
means.   This is more complicated, but would be a bit more standard in how
people think about it.

tim



On Mon, Apr 4, 2022 at 7:32 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/4/2022 5:52 AM, Srijan wrote:
> > I am working on designing a Solr based enterprise search solution. One
> > requirement I have is to track crawled data from various different data
> > sources with metadata like crawled date, indexing status and so on. I am
> > looking into using Solr itself as my data store and not adding a separate
> > database to my stack. Has anyone used Solr as a dedicated data store? How
> > did it compare to an RDBMS?
>
> As you've been told, Solr is NOT a database.  It is most definitely not
> equivalent in any way to an RDBMS.  If you want the kinds of things an
> RDBMS is good for, you should use an RDBMS, not Solr.
>
> Handling ever-changing search requirements in Solr is typically going to
> require the kinds of schema changes that need a full reindex.  So you
> probably wouldn't be able to use the same Solr index for your data
> storage as you do for searching anyway.
>
> If you're going to need to set up two Solr installs to handle your
> needs, you should probably NOT use Solr for the storage role.  Use
> something that has been tested and hardened against data loss. Solr does
> do its best to never lose data, but guaranteed data durability is not
> one of its design goals.  The changes that would be required to make
> that guarantee would most likely have an extremely adverse effect on
> search performance.
>
> Solr's core functionality has always been search.  Search is what it's
> good at, and that's what will be optimized in future versions ... not
> any kind of database functionality.
>
> Thanks,
> Shawn
>
>

Reply via email to