Shawn,
On 6/13/22 14:40, Shawn Heisey wrote:
On 6/13/2022 10:14 AM, Christopher Schultz wrote:
1. Re: regular re-indexes. I've just built this into my web
application so it's literally a one-click administrative
background-process kick-off. I've been trying to get automatic
schema-provisioning as well (see my recent posts to users@) just in
case the index doesn't even exist at first. The idea is to make new
application installations / DR a simpler and more automated process.
The best option is to entirely eradicate the existing index before
rebuilding it. One way to do this is to completely delete the index
directory and then reload the core or restart Solr. Another way is to
delete all documents and then optimize the index. Lucene will see that
none of the segments contain non-deleted documents and will completely
delete them all. It should be effectively equivalent to deleting the
index directory and reloading. This is what my rebuild script for my
current Solr install does. A full reindex only takes about ten minutes,
though.
In my testing for (automatically) creating cores from scratch, I found
that the schema for the core seems to survive that process. Running
"solr -d corename" will delete the core and the on-disk directory. But
re-creating the core somehow resurrects the old schema. I can ask about
that under separate cover, but is that going to complicate this process?
Another option would be to create a core with a new name and then "swap
cores" which is a process I know exists merely because there is a button
for it in the admin web UI.
2. "Index upgrader tool" -- I have no idea what this is. Do I need to
care? Or are you saying that if I upgrade from 7.x -> 9.x I won't even
be able to write to the same on-disk index artifacts at all, unless I
create a new core?
IndexUpgrader is something provided by Lucene. All it does is a
forceMerge down to one segment -- equivalent to "optimize" in Solr. This
upgrades the index to the current Lucene version as fully as is
possible, but the version from the old segments is preserved even
through the merge.
That version preservation is why if you try upgrading from 7.x to 9.x,
even if you take an intermediate step of running IndexUpgrader from 8.x,
you won't even be able to READ the index, much less write to it. Lucene
will refuse to open it.
Okay. So if I do what I initially proposed:
1. delete *:*
2. re-index everything
But otherwise leave the core alone... will I have successfully
"re-built" the index such that I don't have that old-version lurking
around waiting to bite me in future upgrades?
Is step 1 even necessary? If I refresh every single document in the
index, would it ultimately purge old segments by (a) marking all
documents in the old segments as "deleted" and (b) creating only new
segments to contain the new documents? I will be "replacing" each
document with an updated one: their ids will remain stable from
pre-re-index to post-re-index.
4. Re: Complete re-build of infrastructure + cut-over: we abuse Solr a
little and use it as an online system and not just a static "product
catalog" or whatever. We actually use it to store application user
information so we can perform quick user-searches. We have several
applications all connecting to the same index and contributing updates
and performing queries, so a clean switchover is difficult to do (we
aren't using an intermediate proxy). I suppose introducing a proxy
wouldn't be the worst possible idea.
The way I managed this was a little involved.
I had two complete online copies of the index, three if you count the
dev server. Each copy was independently updated, I did not use
replication. I used haproxy and pacemaker to float a virtual IP address
between some of the servers and automatically switch to another copy of
the index if the main copy went down.
Each copy of the index had two cores for each shard -- a live core and a
build core. A full rebuild would build into the build cores (wiping
them as mentioned above before beginning any indexing), and then once
the rebuild was completely done, swap the live cores with the build cores.
In cloud mode, you cannot follow that paradigm precisely. Instead, you
would just create a new collection for a rebuild, and once it was ready,
update a collection alias to point to the new collection.
Okay. I have one-and-only-one Solr node at this point (which is
sufficient for my current needs) so it's a little simpler than your
deployment described above. The one monkey wrench is that the "online
core" could theoretically get updates while the "build core" is being
re-generated from scratch. That won't be a problem if the re-index
operation hasn't yet gotten to the user who was updated during that
interval, but if a user gets updated who was already re-indexed, then
there could be a problem.
As it stands right this moment, the re-index and operational changes are
being made in realtime on the same exact core, so assuming there isn't a
disaster, the index will always be consistent and up-to-date and I don't
have to do any post-re-index re-re-index of anything that may have been
left behind or missed during the process.
Given the simple (possibly bordering on naive) process above, what steps
would I need to take to ensure that the resulting core state is usable
by Solr N+1, etc. in the future?
Thanks,
-chris