Re: Solr as a dedicated data store?

James Greene Thu, 07 Apr 2022 21:39:16 -0700

I mean to only encourage focus on stability between releases and offer
migration path options.  I AM a fan boy of technology that offers an easier
path of adoption/maintainability over its competitors.


On Thu, Apr 7, 2022, 11:11 PM Gus Heck <gus.h...@gmail.com> wrote:

> It's not shocking that there are differences among products. If that
> feature is your favorite, use elastic. There are other features... and
> licensing which matters to some. Amazon's effort is interesting, but will
> it persist? When Oracle bought Mysql AB a site named dorsal source dot org
> (don't' go there it's now inhabited by an attack site AFAICT, but you can
> see it on wayback machine ~2008) sprang up in response (a friend of mine
> was involved). Granted it was not backed by a big company, but it was
> useful for a while. Even big companies may change priorities over time and
> sunset things. Open source projects can be archived too, but Lucene and
> Solr are among the most active, so that is clearly not a near term risk.
> Your tone however sounds a bit fanboyish, and sounds a bit like you forget
> that the folks who maintain solr are all volunteers. If you see things that
> need fixing, improved or want to argue for change without disparaging
> comments, we certainly welcome your input, (and your code if you are so
> inclined).
>
> -Gus
>
> On Thu, Apr 7, 2022 at 10:41 PM James Greene <ja...@jamesaustingreene.com>
> wrote:
>
> > > so that we can be free to make improvements without having to carry an
> > ever growing weight of back compatibility
> >
> >
> > This is actually why people abandon solr for elastic/opensearch. Solrs
> core
> > contributors  hold little value in supporting migration paths and
> stability
> > with-in so it's always a heavy cost to users for upgrades.
> >
> > Very few people think solr is stable between upgrades (anyone?
> Bueller....
> > anyone?). This means  you need to plan for the migration of data
> > (time/storage) between upgrades.  This doesn't mean you need to reindex
> > from source (you will be reindexing) it means you cannot get more/new
> data
> > from source that you didn't include in your original document when
> > indexing.  There are strategies for storing full "source documents"
> without
> > having them indexed that allow you to re-index from the stored document
> > (non-indexed fields) without requiring you to have a totally separate
> > persistence layer.
> >
> >
> >
> > On Thu, Apr 7, 2022, 10:03 PM Dave <hastings.recurs...@gmail.com> wrote:
> >
> > > This is one of the most interesting and articulate emails I’ve read
> about
> > > the fundamentals in a long time. Saving this one :)
> > >
> > > > On Apr 7, 2022, at 9:32 PM, Gus Heck <gus.h...@gmail.com> wrote:
> > > >
> > > > Solr is not a "good" primary data store. Solr is built for finding
> > your
> > > > documents, not storing them. A good primary data store holds stuff
> > > > indefinitely without adding weight and without changing regularly,
> Solr
> > > > doesn't fit that description. One of the biggest reasons for this is
> > that
> > > > at some point you'll want to upgrade to the latest version, and we
> only
> > > > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc...
> > > multiple
> > > > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that
> this
> > > was
> > > > actually enforced by the code but I didn't find the check on a quick
> > look
> > > > through the code (doesn't mean it isn't there, just I didn't find
> it).
> > In
> > > > any case, multi-version upgrades are not generally supported,
> > > intentionally
> > > > so that we can be free to make improvements without having to carry
> an
> > > ever
> > > > growing weight of back compatibility. Typically if new index features
> > are
> > > > developed and you want to use them (like when doc values were
> > introduced)
> > > > you will need to re-index to use the new feature. Search engines
> > > > precalculate and write typically denormalized or otherwise processed
> > > > information into the index prioritizing speed of retrieval over space
> > and
> > > > long term storage. As others have mentioned, there is also the ever
> > > > changing requirements problem. Typically someone in product
> management
> > or
> > > > if you are unlucky, the CEO hears of something cool someone did with
> > solr
> > > > and says: Hey, let's do that too! I bet it would really draw
> customers
> > > > in!... 9 times out of 10 the new thing involves changing the way
> > > something
> > > > is analyzed, or adding a new analysis of previously ingested data. If
> > you
> > > > can't reindex you have to be able to say no, not on old data" and
> > > possibly
> > > > say "we'll need a separate collection for the new data and it will be
> > > > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> > > >
> > > > The ability to add fields to documents is much more for things like
> > > adding
> > > > searchable geo-located gps coordinates for documents that have a
> > location
> > > > or metadata like you mention to a document than for storing the
> > document
> > > > content itself. It is *possible* to have self re-indexing documents
> > that
> > > > contain all the data needed to repeat indexing but it takes a lot of
> > > space
> > > > and, slows down your index. Furthermore it will requires that all
> > > indexing
> > > > enrichment/cleaning/etc be baked inside solr using
> > > > updateProcessorFactories... which in turn makes all that indexing
> work
> > > > compete more heavily with search queries... or alternately requires
> > that
> > > > the data has to be queried out and inserted back in after external
> > > > processing, which is also going to compete with user queries (so
> maybe
> > > one
> > > > winds up fielding extra hardware or even 2 clusters - twice as many
> > > > machines - and swapping clusters back and forth periodically, now
> it's,
> > > > complex, expensive and has a very high index latency instead of slow
> > > > queries... no free lunch there) Trying to store the original data
> just
> > > > complicates matters. Keeping it simple and using solr to find things
> > that
> > > > are then served from a primary source is really the best place to
> > start.
> > > >
> > > > So yeah, you *could* use it as a primary store with work and
> acceptance
> > > of
> > > > limitations, but you have to be aware of what you are doing, and
> have a
> > > > decently working crystal ball. I never advise clients to do this
> > because
> > > I
> > > > prefer happy clients that say nice things about me :) So my advice to
> > you
> > > > is don't do it unless there is an extremely compelling reason.
> > > >
> > > > Assuming you're not dealing with really massive amounts of data, just
> > > > indexing some internal intranet (and it's not something the size of
> > apple
> > > > or google), then for your use case, crawling pages, I'd have the
> > crawler
> > > > drop anything it finds and considers worthy of indexing to a
> filesystem
> > > > (maybe 2 files, the content and a file with metadata like the link
> > where
> > > it
> > > > was found), have a separate indexing process scan the filesystem
> > > > periodically, munge it for metadata or whatever other manipulations
> are
> > > > useful and then write the result to solr. If the crawl store is
> > designed
> > > so
> > > > the same document always lands in the same location and you don't
> have
> > to
> > > > worry about growth other than the growth of the site(s) you are
> > indexing.
> > > > There are ways to improve on things from there such as adding a kafka
> > > > instance for a topic that identifies newly fetched docs to prevent
> (or
> > > > augment) the periodic scanning. Also storing a hash of the content
> in a
> > > > database to let the indexer ignore when the crawler simply downloaded
> > the
> > > > same bytes, cause nothing's changed...
> > > >
> > > > And you'll want to decide if you want to remove references to pages
> > that
> > > > disappeared or detect moves/renames vs deletions which is a whole
> thing
> > > of
> > > > its own...
> > > >
> > > > My side project JesterJ.org <https://www.JesterJ.org> provides a
> good
> > > deal
> > > > of the indexer features I describe (but it still needs a kafka
> > connector,
> > > > contributions welcome :) ). Some folks have used it profitably, but
> > it's
> > > > admittedly still rough, and the current master is much better than
> the
> > > now
> > > > ancient, last released beta (which probably should have been an alpha
> > but
> > > > oh well :)
> > > >
> > > > -Gus
> > > >
> > > >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <
> > > dominique.bej...@eolya.fr>
> > > >> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> A best practice for performances and ressources usage is to store
> > and/or
> > > >> index and/or docValues only data required for your search features.
> > > >> However, in order to implement or modify new or existing features in
> > an
> > > >> index you will need to reindex all the data in this index.
> > > >>
> > > >> I propose 2 solutions :
> > > >>
> > > >>   - The first one is to store the full original JSON data into the
> > _str_
> > > >>   fields of the index.
> > > >>
> > > >>
> > > >>
> > >
> >
> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
> > > >>
> > > >>
> > > >>   - The second and the best solution in my opinion is to store the
> > JSON
> > > >>   data into an intermediate feature neutral data store as a file
> > simple
> > > >> file
> > > >>   system or better a MongoDB database. This way will allow you to
> use
> > > your
> > > >>   data in several indexes (one index for search, one index for
> > > suggesters,
> > > >>   ...)  without duplicating data into _src_ fields in each index. A
> > uuid
> > > >> in
> > > >>   each index will allow you to get the full JSON object in MongoDB.
> > > >>
> > > >>
> > > >> Obviously a key point is the backup strategy of your data store
> > > according
> > > >> to the solution you choose : either Solr indexes or the file system
> or
> > > the
> > > >> MongoDB database.
> > > >>
> > > >> Dominique
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit :
> > > >>>
> > > >>> Hi All,
> > > >>>
> > > >>> I am working on designing a Solr based enterprise search solution.
> > One
> > > >>> requirement I have is to track crawled data from various different
> > data
> > > >>> sources with metadata like crawled date, indexing status and so
> on. I
> > > am
> > > >>> looking into using Solr itself as my data store and not adding a
> > > separate
> > > >>> database to my stack. Has anyone used Solr as a dedicated data
> store?
> > > How
> > > >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of
> > > Crawl
> > > >>> DB - can someone here share some insight into how Fusion is using
> > this
> > > >>> 'DB'? My store will need to track millions of objects and be able
> to
> > > >> handle
> > > >>> parallel adds/updates. Do you think Solr is a good tool for this or
> > am
> > > I
> > > >>> better off depending on a database service?
> > > >>>
> > > >>> Thanks a bunch.
> > > >>>
> > > >>
> > > >
> > > >
> > > > --
> > > > http://www.needhamsoftware.com (work)
> > > > http://www.the111shift.com (play)
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: Solr as a dedicated data store?

Reply via email to