Shawn,

On 6/13/22 21:01, Shawn Heisey wrote:
On 6/13/2022 1:19 PM, Christopher Schultz wrote:
Okay. So if I do what I initially proposed:

1. delete *:*
2. re-index everything

If you do this but do not optimize the index (which will happen practically instantaneously because the index will consist of only deleted docs), then I can make no guarantees.  Lucene might be smart enough to delete all the segment files even without an optimize, but I hate leaving that to chance.

Is step 1 even necessary? If I refresh every single document in the index, would it ultimately purge old segments by (a) marking all documents in the old segments as "deleted" and (b) creating only new segments to contain the new documents? I will be "replacing" each document with an updated one: their ids will remain stable from pre-re-index to post-re-index.

Yes, deleting the documents and making sure that all the segment files actually get deleted would be required if you want to be absolutely certain that no prior version data remains.  If you reindex docs into the same index and leave it up to Solr's "replace docs with the same id" functionality, you might get bitten by an old segment version remaining in the index after the full reindex is done, because of ongoing background segment merging procedures.  There is just no way I can think of to guarantee that won't happen, other than completely wiping the index.

Does that mean I need to:

1. delete *:*
2. optimize
3. re-index everything

Is #2 something available via the SolrJ client, or do I have to issue a REST call for that?

Okay. I have one-and-only-one Solr node at this point (which is sufficient for my current needs) so it's a little simpler than your deployment described above. The one monkey wrench is that the "online core" could theoretically get updates while the "build core" is being re-generated from scratch. That won't be a problem if the re-index operation hasn't yet gotten to the user who was updated during that interval, but if a user gets updated who was already re-indexed, then there could be a problem.

I used a mysql database as the source for my index data.  The primary key was "did" ... which is basically shorthand for "delete id".  This was an autoincrement field in the database. Effectively, also a unique field.  The uniqueKey was another field in the index, though, and that was the true identifier for each doc.

I kept track of where both the ongoing indexing and a full reindex were with that "did" field.  Any new data was guaranteed to have a did value larger than anything that was indexed before.  I had a separate table listing the uniqueKey value for each document that needed to be reindexed "manually", and another table to track deletes.

I'm not sure we need that kind of complexity. I'm happy with my current re-index implementation because it's very straightforward. The only downside is that if you delete-all-documents before the re-index (which should be a rare process indeed), then ... the index won't show those records that haven't yet been re-indexed. The user-search in my application is mostly administrative, so it shouldn't impact many "regular" users.

As I said, we are (ab)using Solr in a bit of a different way than is traditional :)

As it stands right this moment, the re-index and operational changes are being made in realtime on the same exact core, so assuming there isn't a disaster, the index will always be consistent and up-to-date and I don't have to do any post-re-index re-re-index of anything that may have been left behind or missed during the process.

Given the simple (possibly bordering on naive) process above, what steps would I need to take to ensure that the resulting core state is usable by Solr N+1, etc. in the future?

Just be sure that there are no Lucene segments in the index directory before you begin a full index rebuild.  If any are hanging around, you could end up with an incompatible version number in the index when you try to upgrade.

I'd prefer to only make calls via SolrJ or, if necessary, via REST. So "ensuring the files are deleted from the disk" is not really possible... the Solr server is "over there" and so I can't see the disk from my application.

-chris

Reply via email to