Re: Re-index after upgrade

Christopher Schultz Tue, 14 Jun 2022 09:22:49 -0700

Shawn,

On 6/13/22 21:01, Shawn Heisey wrote:

On 6/13/2022 1:19 PM, Christopher Schultz wrote:
Okay. So if I do what I initially proposed:
1. delete *:*
2. re-index everything
If you do this but do not optimize the index (which will happenpractically instantaneously because the index will consist of onlydeleted docs), then I can make no guarantees. Lucene might be smartenough to delete all the segment files even without an optimize, but Ihate leaving that to chance.
Is step 1 even necessary? If I refresh every single document in theindex, would it ultimately purge old segments by (a) marking alldocuments in the old segments as "deleted" and (b) creating only newsegments to contain the new documents? I will be "replacing" eachdocument with an updated one: their ids will remain stable frompre-re-index to post-re-index.
Yes, deleting the documents and making sure that all the segment filesactually get deleted would be required if you want to be absolutelycertain that no prior version data remains. If you reindex docs intothe same index and leave it up to Solr's "replace docs with the same id"functionality, you might get bitten by an old segment version remainingin the index after the full reindex is done, because of ongoingbackground segment merging procedures. There is just no way I can thinkof to guarantee that won't happen, other than completely wiping the index.


Does that mean I need to:

1. delete *:*
2. optimize
3. re-index everything

Is #2 something available via the SolrJ client, or do I have to issue aREST call for that?

Okay. I have one-and-only-one Solr node at this point (which issufficient for my current needs) so it's a little simpler than yourdeployment described above. The one monkey wrench is that the "onlinecore" could theoretically get updates while the "build core" is beingre-generated from scratch. That won't be a problem if the re-indexoperation hasn't yet gotten to the user who was updated during thatinterval, but if a user gets updated who was already re-indexed, thenthere could be a problem.
I used a mysql database as the source for my index data. The primarykey was "did" ... which is basically shorthand for "delete id". Thiswas an autoincrement field in the database. Effectively, also a uniquefield. The uniqueKey was another field in the index, though, and thatwas the true identifier for each doc.
I kept track of where both the ongoing indexing and a full reindex werewith that "did" field. Any new data was guaranteed to have a did valuelarger than anything that was indexed before. I had a separate tablelisting the uniqueKey value for each document that needed to bereindexed "manually", and another table to track deletes.

I'm not sure we need that kind of complexity. I'm happy with my currentre-index implementation because it's very straightforward. The onlydownside is that if you delete-all-documents before the re-index (whichshould be a rare process indeed), then ... the index won't show thoserecords that haven't yet been re-indexed. The user-search in myapplication is mostly administrative, so it shouldn't impact many"regular" users.

As I said, we are (ab)using Solr in a bit of a different way than istraditional :)

As it stands right this moment, the re-index and operational changesare being made in realtime on the same exact core, so assuming thereisn't a disaster, the index will always be consistent and up-to-dateand I don't have to do any post-re-index re-re-index of anything thatmay have been left behind or missed during the process.
Given the simple (possibly bordering on naive) process above, whatsteps would I need to take to ensure that the resulting core state isusable by Solr N+1, etc. in the future?
Just be sure that there are no Lucene segments in the index directorybefore you begin a full index rebuild. If any are hanging around, youcould end up with an incompatible version number in the index when youtry to upgrade.

I'd prefer to only make calls via SolrJ or, if necessary, via REST. So"ensuring the files are deleted from the disk" is not really possible...the Solr server is "over there" and so I can't see the disk from myapplication.


-chris

Re: Re-index after upgrade

Reply via email to