I mean to only encourage focus on stability between releases and offer migration path options. I AM a fan boy of technology that offers an easier path of adoption/maintainability over its competitors.
On Thu, Apr 7, 2022, 11:11 PM Gus Heck <gus.h...@gmail.com> wrote: > It's not shocking that there are differences among products. If that > feature is your favorite, use elastic. There are other features... and > licensing which matters to some. Amazon's effort is interesting, but will > it persist? When Oracle bought Mysql AB a site named dorsal source dot org > (don't' go there it's now inhabited by an attack site AFAICT, but you can > see it on wayback machine ~2008) sprang up in response (a friend of mine > was involved). Granted it was not backed by a big company, but it was > useful for a while. Even big companies may change priorities over time and > sunset things. Open source projects can be archived too, but Lucene and > Solr are among the most active, so that is clearly not a near term risk. > Your tone however sounds a bit fanboyish, and sounds a bit like you forget > that the folks who maintain solr are all volunteers. If you see things that > need fixing, improved or want to argue for change without disparaging > comments, we certainly welcome your input, (and your code if you are so > inclined). > > -Gus > > On Thu, Apr 7, 2022 at 10:41 PM James Greene <ja...@jamesaustingreene.com> > wrote: > > > > so that we can be free to make improvements without having to carry an > > ever growing weight of back compatibility > > > > > > This is actually why people abandon solr for elastic/opensearch. Solrs > core > > contributors hold little value in supporting migration paths and > stability > > with-in so it's always a heavy cost to users for upgrades. > > > > Very few people think solr is stable between upgrades (anyone? > Bueller.... > > anyone?). This means you need to plan for the migration of data > > (time/storage) between upgrades. This doesn't mean you need to reindex > > from source (you will be reindexing) it means you cannot get more/new > data > > from source that you didn't include in your original document when > > indexing. There are strategies for storing full "source documents" > without > > having them indexed that allow you to re-index from the stored document > > (non-indexed fields) without requiring you to have a totally separate > > persistence layer. > > > > > > > > On Thu, Apr 7, 2022, 10:03 PM Dave <hastings.recurs...@gmail.com> wrote: > > > > > This is one of the most interesting and articulate emails I’ve read > about > > > the fundamentals in a long time. Saving this one :) > > > > > > > On Apr 7, 2022, at 9:32 PM, Gus Heck <gus.h...@gmail.com> wrote: > > > > > > > > Solr is not a "good" primary data store. Solr is built for finding > > your > > > > documents, not storing them. A good primary data store holds stuff > > > > indefinitely without adding weight and without changing regularly, > Solr > > > > doesn't fit that description. One of the biggest reasons for this is > > that > > > > at some point you'll want to upgrade to the latest version, and we > only > > > > support a single interim upgrade. So from 6 to 7 or 7 to 8 etc... > > > multiple > > > > step upgrades 6 to 7 to 8 may fail. I seem to recall hearing that > this > > > was > > > > actually enforced by the code but I didn't find the check on a quick > > look > > > > through the code (doesn't mean it isn't there, just I didn't find > it). > > In > > > > any case, multi-version upgrades are not generally supported, > > > intentionally > > > > so that we can be free to make improvements without having to carry > an > > > ever > > > > growing weight of back compatibility. Typically if new index features > > are > > > > developed and you want to use them (like when doc values were > > introduced) > > > > you will need to re-index to use the new feature. Search engines > > > > precalculate and write typically denormalized or otherwise processed > > > > information into the index prioritizing speed of retrieval over space > > and > > > > long term storage. As others have mentioned, there is also the ever > > > > changing requirements problem. Typically someone in product > management > > or > > > > if you are unlucky, the CEO hears of something cool someone did with > > solr > > > > and says: Hey, let's do that too! I bet it would really draw > customers > > > > in!... 9 times out of 10 the new thing involves changing the way > > > something > > > > is analyzed, or adding a new analysis of previously ingested data. If > > you > > > > can't reindex you have to be able to say no, not on old data" and > > > possibly > > > > say "we'll need a separate collection for the new data and it will be > > > > difficult to search both" when asked by PM/CEO/YourBiggestClient/etc. > > > > > > > > The ability to add fields to documents is much more for things like > > > adding > > > > searchable geo-located gps coordinates for documents that have a > > location > > > > or metadata like you mention to a document than for storing the > > document > > > > content itself. It is *possible* to have self re-indexing documents > > that > > > > contain all the data needed to repeat indexing but it takes a lot of > > > space > > > > and, slows down your index. Furthermore it will requires that all > > > indexing > > > > enrichment/cleaning/etc be baked inside solr using > > > > updateProcessorFactories... which in turn makes all that indexing > work > > > > compete more heavily with search queries... or alternately requires > > that > > > > the data has to be queried out and inserted back in after external > > > > processing, which is also going to compete with user queries (so > maybe > > > one > > > > winds up fielding extra hardware or even 2 clusters - twice as many > > > > machines - and swapping clusters back and forth periodically, now > it's, > > > > complex, expensive and has a very high index latency instead of slow > > > > queries... no free lunch there) Trying to store the original data > just > > > > complicates matters. Keeping it simple and using solr to find things > > that > > > > are then served from a primary source is really the best place to > > start. > > > > > > > > So yeah, you *could* use it as a primary store with work and > acceptance > > > of > > > > limitations, but you have to be aware of what you are doing, and > have a > > > > decently working crystal ball. I never advise clients to do this > > because > > > I > > > > prefer happy clients that say nice things about me :) So my advice to > > you > > > > is don't do it unless there is an extremely compelling reason. > > > > > > > > Assuming you're not dealing with really massive amounts of data, just > > > > indexing some internal intranet (and it's not something the size of > > apple > > > > or google), then for your use case, crawling pages, I'd have the > > crawler > > > > drop anything it finds and considers worthy of indexing to a > filesystem > > > > (maybe 2 files, the content and a file with metadata like the link > > where > > > it > > > > was found), have a separate indexing process scan the filesystem > > > > periodically, munge it for metadata or whatever other manipulations > are > > > > useful and then write the result to solr. If the crawl store is > > designed > > > so > > > > the same document always lands in the same location and you don't > have > > to > > > > worry about growth other than the growth of the site(s) you are > > indexing. > > > > There are ways to improve on things from there such as adding a kafka > > > > instance for a topic that identifies newly fetched docs to prevent > (or > > > > augment) the periodic scanning. Also storing a hash of the content > in a > > > > database to let the indexer ignore when the crawler simply downloaded > > the > > > > same bytes, cause nothing's changed... > > > > > > > > And you'll want to decide if you want to remove references to pages > > that > > > > disappeared or detect moves/renames vs deletions which is a whole > thing > > > of > > > > its own... > > > > > > > > My side project JesterJ.org <https://www.JesterJ.org> provides a > good > > > deal > > > > of the indexer features I describe (but it still needs a kafka > > connector, > > > > contributions welcome :) ). Some folks have used it profitably, but > > it's > > > > admittedly still rough, and the current master is much better than > the > > > now > > > > ancient, last released beta (which probably should have been an alpha > > but > > > > oh well :) > > > > > > > > -Gus > > > > > > > >> On Mon, Apr 4, 2022 at 8:19 AM Dominique Bejean < > > > dominique.bej...@eolya.fr> > > > >> wrote: > > > >> > > > >> Hi, > > > >> > > > >> A best practice for performances and ressources usage is to store > > and/or > > > >> index and/or docValues only data required for your search features. > > > >> However, in order to implement or modify new or existing features in > > an > > > >> index you will need to reindex all the data in this index. > > > >> > > > >> I propose 2 solutions : > > > >> > > > >> - The first one is to store the full original JSON data into the > > _str_ > > > >> fields of the index. > > > >> > > > >> > > > >> > > > > > > https://solr.apache.org/guide/8_11/transforming-and-indexing-custom-json.html#setting-json-default > > > >> > > > >> > > > >> - The second and the best solution in my opinion is to store the > > JSON > > > >> data into an intermediate feature neutral data store as a file > > simple > > > >> file > > > >> system or better a MongoDB database. This way will allow you to > use > > > your > > > >> data in several indexes (one index for search, one index for > > > suggesters, > > > >> ...) without duplicating data into _src_ fields in each index. A > > uuid > > > >> in > > > >> each index will allow you to get the full JSON object in MongoDB. > > > >> > > > >> > > > >> Obviously a key point is the backup strategy of your data store > > > according > > > >> to the solution you choose : either Solr indexes or the file system > or > > > the > > > >> MongoDB database. > > > >> > > > >> Dominique > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >>> Le lun. 4 avr. 2022 à 13:53, Srijan <shree...@gmail.com> a écrit : > > > >>> > > > >>> Hi All, > > > >>> > > > >>> I am working on designing a Solr based enterprise search solution. > > One > > > >>> requirement I have is to track crawled data from various different > > data > > > >>> sources with metadata like crawled date, indexing status and so > on. I > > > am > > > >>> looking into using Solr itself as my data store and not adding a > > > separate > > > >>> database to my stack. Has anyone used Solr as a dedicated data > store? > > > How > > > >>> did it compare to an RDBMS? I see Lucidworks Fusion has a notion of > > > Crawl > > > >>> DB - can someone here share some insight into how Fusion is using > > this > > > >>> 'DB'? My store will need to track millions of objects and be able > to > > > >> handle > > > >>> parallel adds/updates. Do you think Solr is a good tool for this or > > am > > > I > > > >>> better off depending on a database service? > > > >>> > > > >>> Thanks a bunch. > > > >>> > > > >> > > > > > > > > > > > > -- > > > > http://www.needhamsoftware.com (work) > > > > http://www.the111shift.com (play) > > > > > > > > -- > http://www.needhamsoftware.com (work) > http://www.the111shift.com (play) >