This is one of the most interesting and articulate emails I’ve read about the 
fundamentals in a long time. Saving this one :)

> On Apr 7, 2022, at 9:32 PM, Gus Heck <gus.h...@gmail.com> wrote:
> 
> Solr is not a "good" primary data store. Solr is built for finding your
> documents, not storing them. A good primary data store holds stuff
> indefinitely without adding weight and without changing regularly, Solr
> doesn't fit that description. One of the biggest reasons for this is that
> at some point you'll want to upgrade to the latest version, and we only
> support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... multiple
> step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that this was
> actually enforced by the code but I didn't find the check on a quick look
> through the code (doesn't mean it isn't there, just I didn't find it). In
> any case, multi-version upgrades are not generally supported, intentionally
> so that we can be free to make improvements without having to carry an ever
> growing weight of back compatibility. Typically if new index features are
> developed and you want to use them (like when doc values were introduced)
> you will need to re-index to use the new feature. Search engines
> precalculate and write typically denormalized or otherwise processed
> information into the index prioritizing speed of retrieval over space and
> long term storage. As others have mentioned, there is also the ever
> changing requirements problem. Typically someone in product management or
> if you are unlucky, the CEO hears of something cool someone did with solr
> and says: Hey, let's do that too! I bet it would really draw customers
> in!... 9 times out of 10 the new thing involves changing the way something
> is analyzed, or adding a new analysis of previously ingested data. If you
> can't reindex you have to be able to say no, not on old data" and possibly
> say "we'll need a separate collection for the new data and it will be
> difficult to search both" when asked by PM/CEO/YourBiggestClient/etc.
> 
> The ability to add fields to documents is much more for things like adding
> searchable geo-located gps coordinates for documents that have a location
> or metadata like you mention to a document than for storing the document
> content itself. It is *possible* to have self re-indexing documents that
> contain all the data needed to repeat indexing but it takes a lot of space
> and, slows down your index. Furthermore it will requires that all indexing
> enrichment/cleaning/etc be baked inside solr using
> updateProcessorFactories... which in turn makes all that indexing work
> compete more heavily with search queries... or alternately requires that
> the data has to be queried out and inserted back in after external
> processing, which is also going to compete with user queries (so maybe one
> winds up fielding extra hardware or even 2 clusters - twice as many
> machines - and swapping clusters back and forth periodically, now it's,
> complex, expensive and has a very high index latency instead of slow
> queries... no free lunch there) Trying to store the original data just
> complicates matters. Keeping it simple and using solr to find things that
> are then served from a primary source is really the best place to start.
> 
> So yeah, you *could* use it as a primary store with work and acceptance of
> limitations, but you have to be aware of what you are doing, and have a
> decently working crystal ball. I never advise clients to do this because I
> prefer happy clients that say nice things about me :) So my advice to you
> is don't do it unless there is an extremely compelling reason.
> 
> Assuming you're not dealing with really massive amounts of data, just
> indexing some internal intranet (and it's not something the size of apple
> or google), then for your use case, crawling pages, I'd have the crawler
> drop anything it finds and considers worthy of indexing to a filesystem
> (maybe 2 files, the content and a file with metadata like the link where it
> was found), have a separate indexing process scan the filesystem
> periodically, munge it for metadata or whatever other manipulations are
> useful and then write the result to solr. If the crawl store is designed so
> the same document always lands in the same location and you don't have to
> worry about growth other than the growth of the site(s) you are indexing.
> There are ways to improve on things from there such as adding a kafka
> instance for a topic that identifies newly fetched docs to prevent (or
> augment) the periodic scanning. Also storing a hash of the content in a
> database to let the indexer ignore when the crawler simply downloaded the
> same bytes, cause nothing's changed...
> 
> And you'll want to decide if you want to remove references to pages that
> disappeared or detect moves/renames vs deletions which is a whole thing of
> its own...
> 
> My side project JesterJ.org <https://www.JesterJ.org> provides a good deal
> of the indexer features I describe (but it still needs a kafka connector,
> contributions welcome :) ). Some folks have used it profitably, but it's
> admittedly still rough, and the current master is much better than the now
> ancient, last released beta (which probably should have been an alpha but
> oh well :)
> 
> -Gus
> 
>> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean <dominique.bej...@eolya.fr>
>> wrote:
>> 
>> Hi,
>> 
>> A best practice for performances and ressources usage is to store and/or
>> index and/or docValues only data required for your search features.
>> However, in order to implement or modify new or existing features in an
>> index you will need to reindex all the data in this index.
>> 
>> I propose 2 solutions :
>> 
>>   - The first one is to store the full original JSON data into the _str_
>>   fields of the index.
>> 
>> 
>> https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default
>> 
>> 
>>   - The second and the best solution in my opinion is to store the JSON
>>   data into an intermediate feature neutral data store as a file simple
>> file
>>   system or better a MongoDB database. This way will allow you to use your
>>   data in several indexes (one index for search, one index for suggesters,
>>   ...)  without duplicating data into _src_ fields in each index. A uuid
>> in
>>   each index will allow you to get the full JSON object in MongoDB.
>> 
>> 
>> Obviously a key point is the backup strategy of your data store according
>> to the solution you choose : either Solr indexes or the file system or the
>> MongoDB database.
>> 
>> Dominique
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit :
>>> 
>>> Hi All,
>>> 
>>> I am working on designing a Solr based enterprise search solution. One
>>> requirement I have is to track crawled data from various different data
>>> sources with metadata like crawled date, indexing status and so on. I am
>>> looking into using Solr itself as my data store and not adding a separate
>>> database to my stack. Has anyone used Solr as a dedicated data store? How
>>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of Crawl
>>> DB - can someone here share some insight into how Fusion is using this
>>> 'DB'? My store will need to track millions of objects and be able to
>> handle
>>> parallel adds/updates. Do you think Solr is a good tool for this or am I
>>> better off depending on a database service?
>>> 
>>> Thanks a bunch.
>>> 
>> 
> 
> 
> -- 
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)

Reply via email to