Shawn,
On 6/13/22 21:01, Shawn Heisey wrote:
On 6/13/2022 1:19 PM, Christopher Schultz wrote:
Okay. So if I do what I initially proposed:
1. delete *:*
2. re-index everything
If you do this but do not optimize the index (which will happen
practically instantaneously because the index will consist of only
deleted docs), then I can make no guarantees. Lucene might be smart
enough to delete all the segment files even without an optimize, but I
hate leaving that to chance.
Is step 1 even necessary? If I refresh every single document in the
index, would it ultimately purge old segments by (a) marking all
documents in the old segments as "deleted" and (b) creating only new
segments to contain the new documents? I will be "replacing" each
document with an updated one: their ids will remain stable from
pre-re-index to post-re-index.
Yes, deleting the documents and making sure that all the segment files
actually get deleted would be required if you want to be absolutely
certain that no prior version data remains. If you reindex docs into
the same index and leave it up to Solr's "replace docs with the same id"
functionality, you might get bitten by an old segment version remaining
in the index after the full reindex is done, because of ongoing
background segment merging procedures. There is just no way I can think
of to guarantee that won't happen, other than completely wiping the index.
Does that mean I need to:
1. delete *:*
2. optimize
3. re-index everything
Is #2 something available via the SolrJ client, or do I have to issue a
REST call for that?
Okay. I have one-and-only-one Solr node at this point (which is
sufficient for my current needs) so it's a little simpler than your
deployment described above. The one monkey wrench is that the "online
core" could theoretically get updates while the "build core" is being
re-generated from scratch. That won't be a problem if the re-index
operation hasn't yet gotten to the user who was updated during that
interval, but if a user gets updated who was already re-indexed, then
there could be a problem.
I used a mysql database as the source for my index data. The primary
key was "did" ... which is basically shorthand for "delete id". This
was an autoincrement field in the database. Effectively, also a unique
field. The uniqueKey was another field in the index, though, and that
was the true identifier for each doc.
I kept track of where both the ongoing indexing and a full reindex were
with that "did" field. Any new data was guaranteed to have a did value
larger than anything that was indexed before. I had a separate table
listing the uniqueKey value for each document that needed to be
reindexed "manually", and another table to track deletes.
I'm not sure we need that kind of complexity. I'm happy with my current
re-index implementation because it's very straightforward. The only
downside is that if you delete-all-documents before the re-index (which
should be a rare process indeed), then ... the index won't show those
records that haven't yet been re-indexed. The user-search in my
application is mostly administrative, so it shouldn't impact many
"regular" users.
As I said, we are (ab)using Solr in a bit of a different way than is
traditional :)
As it stands right this moment, the re-index and operational changes
are being made in realtime on the same exact core, so assuming there
isn't a disaster, the index will always be consistent and up-to-date
and I don't have to do any post-re-index re-re-index of anything that
may have been left behind or missed during the process.
Given the simple (possibly bordering on naive) process above, what
steps would I need to take to ensure that the resulting core state is
usable by Solr N+1, etc. in the future?
Just be sure that there are no Lucene segments in the index directory
before you begin a full index rebuild. If any are hanging around, you
could end up with an incompatible version number in the index when you
try to upgrade.
I'd prefer to only make calls via SolrJ or, if necessary, via REST. So
"ensuring the files are deleted from the disk" is not really possible...
the Solr server is "over there" and so I can't see the disk from my
application.
-chris