Solr is not a "good" primary data store. Solr is built for finding your
documents, not storing them. A good primary data store holds stuff
indefinitely without adding weight and without changing regularly, Solr
doesn't fit that description. One of the biggest reasons for this is that
at some point you'll want to upgrade to the latest version, and we only
support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple
step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was
actually enforced by the code but I didn't find the check on a quick look
through the code (doesn't mean it isn't there, just I didn't find it). In
any case, multi-version upgrades are not generally supported, intentionally
so that we can be free to make improvements without having to carry an ever
growing weight of back compatibility. Typically if new index features are
developed and you want to use them (like when doc values were introduced)
you will need to re-index to use the new feature. Search engines
precalculate and write typically denormalized or otherwise processed
information into the index prioritizing speed of retrieval over space and
long term storage. As others have mentioned, there is also the ever
changing requirements problem. Typically someone in product management or
if you are unlucky, the CEO hears of something cool someone did with solr
and says: Hey, let's do that too! I bet it would really draw customers
in!... 9 times out of 10 the new thing involves changing the way something
is analyzed, or adding a new analysis of previously ingested data. If you
can't reindex you have to be able to say no, not on old data" and possibly
say "we'll need a separate collection for the new data and it will be
difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.

The ability to add fields to documents is much more for things like adding
searchable geo-located gps coordinates for documents that have a location
or metadata like you mention to a document than for storing the document
content itself. It is *possible* to have self re-indexing documents that
contain all the data needed to repeat indexing but it takes a lot of space
and, slows down your index. Furthermore it will requires that all indexing
enrichment/cleaning/etc be baked inside solr using
updateProcessorFactories... which in turn makes all that indexing work
compete more heavily with search queries... or alternately requires that
the data has to be queried out and inserted back in after external
processing, which is also going to compete with user queries (so maybe one
winds up fielding extra hardware or even 2 clusters - twice as many
machines - and swapping clusters back and forth periodically, now it's,
complex, expensive and has a very high index latency instead of slow
queries... no free lunch there) Trying to store the original data just
complicates matters. Keeping it simple and using solr to find things that
are then served from a primary source is really the best place to start.

So yeah, you *could* use it as a primary store with work and acceptance of
limitations, but you have to be aware of what you are doing, and have a
decently working crystal ball. I never advise clients to do this because I
prefer happy clients that say nice things about me :) So my advice to you
is don't do it unless there is an extremely compelling reason.

Assuming you're not dealing with really massive amounts of data, just
indexing some internal intranet (and it's not something the size of apple
or google), then for your use case, crawling pages, I'd have the crawler
drop anything it finds and considers worthy of indexing to a filesystem
(maybe 2 files, the content and a file with metadata like the link where it
was found), have a separate indexing process scan the filesystem
periodically, munge it for metadata or whatever other manipulations are
useful and then write the result to solr. If the crawl store is designed so
the same document always lands in the same location and you don't have to
worry about growth other than the growth of the site(s) you are indexing.
There are ways to improve on things from there such as adding a kafka
instance for a topic that identifies newly fetched docs to prevent (or
augment) the periodic scanning. Also storing a hash of the content in a
database to let the indexer ignore when the crawler simply downloaded the
same bytes, cause nothing's changed...

And you'll want to decide if you want to remove references to pages that
disappeared or detect moves/renames vs deletions which is a whole thing of
its own...

My side project JesterJ.org <https://www.JesterJ.org> provides a good deal
of the indexer features I describe (but it still needs a kafka connector,
contributions welcome :) ). Some folks have used it profitably, but it's
admittedly still rough, and the current master is much better than the now
ancient, last released beta (which probably should have been an alpha but
oh well :)

-Gus

On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <dominique.bej...@eolya.fr>
wrote:

> Hi,
>
> A best practice for performances and ressources usage is to store and/or
> index and/or docValues only data required for your search features.
> However, in order to implement or modify new or existing features in an
> index you will need to reindex all the data in this index.
>
> I propose 2 solutions :
>
>    - The first one is to store the full original JSON data into the _str_
>    fields of the index.
>
>
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
>
>
>    - The second and the best solution in my opinion is to store the JSON
>    data into an intermediate feature neutral data store as a file simple
> file
>    system or better a MongoDB database. This way will allow you to use your
>    data in several indexes (one index for search, one index for suggesters,
>    ...)  without duplicating data into _src_ fields in each index. A uuid
> in
>    each index will allow you to get the full JSON object in MongoDB.
>
>
> Obviously a key point is the backup strategy of your data store according
> to the solution you choose : either Solr indexes or the file system or the
> MongoDB database.
>
> Dominique
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit :
>
> > Hi All,
> >
> > I am working on designing a Solr based enterprise search solution. One
> > requirement I have is to track crawled data from various different data
> > sources with metadata like crawled date, indexing status and so on. I am
> > looking into using Solr itself as my data store and not adding a separate
> > database to my stack. Has anyone used Solr as a dedicated data store? How
> > did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
> > DB - can someone here share some insight into how Fusion is using this
> > 'DB'? My store will need to track millions of objects and be able to
> handle
> > parallel adds/updates. Do you think Solr is a good tool for this or am I
> > better off depending on a database service?
> >
> > Thanks a bunch.
> >
>


-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Reply via email to