Re: Solr as a dedicated data store?

James Greene Thu, 07 Apr 2022 19:41:24 -0700

> so that we can be free to make improvements without having to carry an
ever growing weight of back compatibility



This is actually why people abandon solr for elastic/opensearch. Solrs core
contributors  hold little value in supporting migration paths and stability
with-in so it's always a heavy cost to users for upgrades.

Very few people think solr is stable between upgrades (anyone? Bueller....
anyone?). This means  you need to plan for the migration of data
(time/storage) between upgrades.  This doesn't mean you need to reindex
from source (you will be reindexing) it means you cannot get more/new data
from source that you didn't include in your original document when
indexing.  There are strategies for storing full "source documents" without
having them indexed that allow you to re-index from the stored document
(non-indexed fields) without requiring you to have a totally separate
persistence layer.



On Thu, Apr 7, 2022, 10:03 PM Dave <hastings.recurs...@gmail.com> wrote:

> This is one of the most interesting and articulate emails I’ve read about
> the fundamentals in a long time. Saving this one :)
>
> > On Apr 7, 2022, at 9:32 PM, Gus Heck <gus.h...@gmail.com> wrote:
> >
> > Solr is not a "good" primary data store. Solr is built for finding your
> > documents, not storing them. A good primary data store holds stuff
> > indefinitely without adding weight and without changing regularly, Solr
> > doesn't fit that description. One of the biggest reasons for this is that
> > at some point you'll want to upgrade to the latest version, and we only
> > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc...
> multiple
> > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this
> was
> > actually enforced by the code but I didn't find the check on a quick look
> > through the code (doesn't mean it isn't there, just I didn't find it). In
> > any case, multi-version upgrades are not generally supported,
> intentionally
> > so that we can be free to make improvements without having to carry an
> ever
> > growing weight of back compatibility. Typically if new index features are
> > developed and you want to use them (like when doc values were introduced)
> > you will need to re-index to use the new feature. Search engines
> > precalculate and write typically denormalized or otherwise processed
> > information into the index prioritizing speed of retrieval over space and
> > long term storage. As others have mentioned, there is also the ever
> > changing requirements problem. Typically someone in product management or
> > if you are unlucky, the CEO hears of something cool someone did with solr
> > and says: Hey, let's do that too! I bet it would really draw customers
> > in!... 9 times out of 10 the new thing involves changing the way
> something
> > is analyzed, or adding a new analysis of previously ingested data. If you
> > can't reindex you have to be able to say no, not on old data" and
> possibly
> > say "we'll need a separate collection for the new data and it will be
> > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> >
> > The ability to add fields to documents is much more for things like
> adding
> > searchable geo-located gps coordinates for documents that have a location
> > or metadata like you mention to a document than for storing the document
> > content itself. It is *possible* to have self re-indexing documents that
> > contain all the data needed to repeat indexing but it takes a lot of
> space
> > and, slows down your index. Furthermore it will requires that all
> indexing
> > enrichment/cleaning/etc be baked inside solr using
> > updateProcessorFactories... which in turn makes all that indexing work
> > compete more heavily with search queries... or alternately requires that
> > the data has to be queried out and inserted back in after external
> > processing, which is also going to compete with user queries (so maybe
> one
> > winds up fielding extra hardware or even 2 clusters - twice as many
> > machines - and swapping clusters back and forth periodically, now it's,
> > complex, expensive and has a very high index latency instead of slow
> > queries... no free lunch there) Trying to store the original data just
> > complicates matters. Keeping it simple and using solr to find things that
> > are then served from a primary source is really the best place to start.
> >
> > So yeah, you *could* use it as a primary store with work and acceptance
> of
> > limitations, but you have to be aware of what you are doing, and have a
> > decently working crystal ball. I never advise clients to do this because
> I
> > prefer happy clients that say nice things about me :) So my advice to you
> > is don't do it unless there is an extremely compelling reason.
> >
> > Assuming you're not dealing with really massive amounts of data, just
> > indexing some internal intranet (and it's not something the size of apple
> > or google), then for your use case, crawling pages, I'd have the crawler
> > drop anything it finds and considers worthy of indexing to a filesystem
> > (maybe 2 files, the content and a file with metadata like the link where
> it
> > was found), have a separate indexing process scan the filesystem
> > periodically, munge it for metadata or whatever other manipulations are
> > useful and then write the result to solr. If the crawl store is designed
> so
> > the same document always lands in the same location and you don't have to
> > worry about growth other than the growth of the site(s) you are indexing.
> > There are ways to improve on things from there such as adding a kafka
> > instance for a topic that identifies newly fetched docs to prevent (or
> > augment) the periodic scanning. Also storing a hash of the content in a
> > database to let the indexer ignore when the crawler simply downloaded the
> > same bytes, cause nothing's changed...
> >
> > And you'll want to decide if you want to remove references to pages that
> > disappeared or detect moves/renames vs deletions which is a whole thing
> of
> > its own...
> >
> > My side project JesterJ.org <https://www.JesterJ.org> provides a good
> deal
> > of the indexer features I describe (but it still needs a kafka connector,
> > contributions welcome :) ). Some folks have used it profitably, but it's
> > admittedly still rough, and the current master is much better than the
> now
> > ancient, last released beta (which probably should have been an alpha but
> > oh well :)
> >
> > -Gus
> >
> >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <
> dominique.bej...@eolya.fr>
> >> wrote:
> >>
> >> Hi,
> >>
> >> A best practice for performances and ressources usage is to store and/or
> >> index and/or docValues only data required for your search features.
> >> However, in order to implement or modify new or existing features in an
> >> index you will need to reindex all the data in this index.
> >>
> >> I propose 2 solutions :
> >>
> >>   - The first one is to store the full original JSON data into the _str_
> >>   fields of the index.
> >>
> >>
> >>
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
> >>
> >>
> >>   - The second and the best solution in my opinion is to store the JSON
> >>   data into an intermediate feature neutral data store as a file simple
> >> file
> >>   system or better a MongoDB database. This way will allow you to use
> your
> >>   data in several indexes (one index for search, one index for
> suggesters,
> >>   ...)  without duplicating data into _src_ fields in each index. A uuid
> >> in
> >>   each index will allow you to get the full JSON object in MongoDB.
> >>
> >>
> >> Obviously a key point is the backup strategy of your data store
> according
> >> to the solution you choose : either Solr indexes or the file system or
> the
> >> MongoDB database.
> >>
> >> Dominique
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit :
> >>>
> >>> Hi All,
> >>>
> >>> I am working on designing a Solr based enterprise search solution. One
> >>> requirement I have is to track crawled data from various different data
> >>> sources with metadata like crawled date, indexing status and so on. I
> am
> >>> looking into using Solr itself as my data store and not adding a
> separate
> >>> database to my stack. Has anyone used Solr as a dedicated data store?
> How
> >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of
> Crawl
> >>> DB - can someone here share some insight into how Fusion is using this
> >>> 'DB'? My store will need to track millions of objects and be able to
> >> handle
> >>> parallel adds/updates. Do you think Solr is a good tool for this or am
> I
> >>> better off depending on a database service?
> >>>
> >>> Thanks a bunch.
> >>>
> >>
> >
> >
> > --
> > http://www.needhamsoftware.com (work)
> > http://www.the111shift.com (play)
>

Re: Solr as a dedicated data store?

Reply via email to