Some thoughts: A lot depends on the use case for this sort of thing. In the case of relatively small installs that can afford to run >2x disk and have significant query latency headroom this might be useful. However, if a company is running a large cluster where maintaining excess capacity costs tens of thousands of dollars, they will often be "cutting it close" on available storage (I've seen yearly storage costs over 100k some places ) and trying to maintain just enough excess query performance to handle "normal" spikes in traffic. Adding the load/disk demands of a re-index within the same cluster (making both query and indexing slower) is usually a bad idea. Even if you index from the index into a new separate cluster, that query load to pull the data from the index may place you above acceptable risk thresholds. For large clusters the only thing that makes sense is reindexing from source to a new cluster that will replace the old cluster, because in that way you can (usually) pull the data much faster without impacting the users. (Notable exceptions crop up in cases where the original source is a live database also used by the users, then some care with the query rate is needed again)
I suppose another use case could be if the cluster is being run on bare metal rather than a service like AWS or a much larger Virtualization environment. In the bare metal case spinning up new machines for temporary use is not an option, but again only if the bare metal solution has enough extra capacity. Another case that might make such a thing interesting would be *if* it was designed to co-locate shards/replicas being reindexed and prevented the need for over the wire transport (caveats about hashing/routing changes, etc). That could speed things up significantly, and a process might look like 1. Upgrade (solr will read index version -1) 2. Clone to 2x disk cluster 3. reindex into peer collection (to reset index version counter) 4. Update alias, delete original collection 5. Clone to 1x disk cluster 6. Swap and sunset original upgraded cluster. If folks have engineered an easy/efficient backup/clone cluster for steps 2 and 5, step 3 could be faster than reindex from originals, reducing parallel run time (which could save money in large installs) Clear documentation of limitations and expected load profiles, throttling etc would be important in any case. It's important to consider the "Big Data" case because if you are lucky, "Small Data" grows into "Big Data." However the transition can be subtle and badly trap people if the transition is not anticipated and well thought out. On Sun, Mar 30, 2025 at 9:21 AM ufuk yılmaz <uyil...@vivaldi.net.invalid> wrote: > I’m guessing this is not simply retrieving all documents through API using > pagination and sending them to index 🤔 About being in-place, how can it > work when a new Solr version requires a different schema or config file, > because time to time old definitions don’t work in a new version. > > -ufuk > > — > > > On Mar 30, 2025, at 10:33, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < > lkotzanie...@bloomberg.net> wrote: > > > > Hi Rahul, > > > > This sounds very interesting! > > > > I enjoyed the discussion at CoC and would be very > > interested to hear more about the technical details. > > > > I am also curious to know more what you mean by "in-place" > > and what the expectation is around downtime. > > > > Either way I am sure this would be a great addition to > > the tool belt for getting people to finally move off > > ancient versions of Solr. > > > > Look forward to discussing this more on the JIRA! > > > > Luke > > > > From: users@solr.apache.org At: 03/28/25 01:05:57 UTC-4:00To: > users@solr.apache.org > > Subject: Automatic upgrade of Solr indexes over multiple versions > > > > Today upgrading from Solr version X to X+2 requires complete reingestion > of > > data from source. This comes from Lucene's constraint which only > guarantees > > index compatibility between the version the index was created in and the > > immediate next version. > > > > > > This reindexing usually comes with added downtime and/or cost. Especially > > in case of deployments which are in customer environments and not > > completely in control of the vendor, this proposition of having to > > completely reindex the data can become a hard sell. > > > > > > I have developed a way which achieves this reindexing in-place on the > same > > index. Also, the process automatically keeps "upgrading" the indexes over > > multiple subsequent Solr upgrades without needing manual intervention. > > > > > > It does come with a limitation that all *source* fields need to be either > > stored=true or docValues=true. Any copyField destination fields can be > > stored=false of course, but as long as the source field (or in general, > the > > fields you care about preserving) is either stored or docValues true , > the > > tool can reindex in-place and legitimately "upgrade" the index. For > indexes > > where this limitation is not a problem (it wasn't for us!), this tool can > > remove a lot of operational headaches, especially in environments with > > hundreds/thousands of very large indexes. > > > > > > I had a conversation about this with some of you during "Apache Community > > over Code 2024" in Denver, and I could sense some interest. If this > feature > > sounds appealing, I would like to contribute it to Solr on behalf of my > > employer, Commvault. Please let me know if I should create a JIRA and get > > the discussion rolling! > > > > > > Thanks, > > Rahul Goswami > > > > > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)