Some good points brought up in the discussion. The implementation we have reindexes a shard reading all documents onto itself, but takes care of the fact that no older version segment merges with a fresh segment. This happens with zero downtime and without requiring a large storage buffer. By the end of the process, you have an index which Solr identifies as being "created in the newer version".
We have tested it on 5+ TB indexes and are happy with the results. Some performance hit for application performance is expected, but for us it is within acceptable limits. With more inputs from the community, I am sure we can polish it further. The goal is to have at least something which will work for a significant user base, or at least have an option available to decide based on individual use-cases. I am working on the design doc to get the discussion started and will share the JIRA by tomorrow night. -Rahul On Mon, Mar 31, 2025 at 1:18 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < lkotzanie...@bloomberg.net> wrote: > >> the only thing that makes sense is reindexing from source to a new > >> cluster that will replace the old cluster > > Ideally yes, but there is a social aspect when Solr is managed as a > service and the many sources are opaque to the team managing it. > Let's assume for the sake of argument that the below is true or > achievable: > > >> solution has enough extra capacity > > I am interested in this: > > >> Another case that might make such a thing interesting would be *if* it > was > >> designed to co-locate shards/replicas being reindexed and prevented the > >> need for over the wire transport (caveats about hashing/routing changes, > >> etc). That could speed things up significantly, and a process might look > >> like > > Let's assume the shard routing is invariant across versions, if you > were able to create these upgraded local replica from their respective > lower-version source replica, how easily could you stitch these together > again into a Solr Cloud? If you were cloning from a collection that was > receiving some live traffic it might be hard because I imagine you'd > need to know which replica of a particular shard was most up-to-date > and ensure that replica became the leader in the new cloud. So would > this effectively require some kind of special leader election logic > or at least some knowledge of the source transaction log as well? > > If we assume a pause to live traffic then this becomes simpler but > then you have the social aspect of coordinating with many teams again. > > In our case, we were considering developing a dual write system with > a versionField defined to ensure consistent ordering between the two > clouds, n and n+m, and having this live outside of Solr. Then, the > actual backfill could be kicked off from some snapshot taken *after* > we enabled dual write. And then finally deleting the old cloud once > we routed traffic to the new one (and let it "bake"). As Gus points > out, at "big data" scale the backfill becomes hard and so the > idea of making this less resource intensive is enticing... > > > From: users@solr.apache.org At: 03/30/25 14:59:16 UTC-4:00To: > users@solr.apache.org > Subject: Re: Automatic upgrade of Solr indexes over multiple versions > > Some thoughts: > > A lot depends on the use case for this sort of thing. In the case of > relatively small installs that can afford to run >2x disk and have > significant query latency headroom this might be useful. However, if a > company is running a large cluster where maintaining excess capacity costs > tens of thousands of dollars, they will often be "cutting it close" on > available storage (I've seen yearly storage costs over 100k some places ) > and trying to maintain just enough excess query performance to handle > "normal" spikes in traffic. Adding the load/disk demands of a re-index > within the same cluster (making both query and indexing slower) is usually > a bad idea. Even if you index from the index into a new separate cluster, > that query load to pull the data from the index may place you above > acceptable risk thresholds. For large clusters the only thing that makes > sense is reindexing from source to a new cluster that will replace the old > cluster, because in that way you can (usually) pull the data much faster > without impacting the users. (Notable exceptions crop up in cases where the > original source is a live database also used by the users, then some care > with the query rate is needed again) > > I suppose another use case could be if the cluster is being run on bare > metal rather than a service like AWS or a much larger Virtualization > environment. In the bare metal case spinning up new machines for temporary > use is not an option, but again only if the bare metal solution has enough > extra capacity. > > Another case that might make such a thing interesting would be *if* it was > designed to co-locate shards/replicas being reindexed and prevented the > need for over the wire transport (caveats about hashing/routing changes, > etc). That could speed things up significantly, and a process might look > like > > 1. Upgrade (solr will read index version -1) > 2. Clone to 2x disk cluster > 3. reindex into peer collection (to reset index version counter) > 4. Update alias, delete original collection > 5. Clone to 1x disk cluster > 6. Swap and sunset original upgraded cluster. > > If folks have engineered an easy/efficient backup/clone cluster for steps 2 > and 5, step 3 could be faster than reindex from originals, reducing > parallel run time (which could save money in large installs) > > Clear documentation of limitations and expected load profiles, throttling > etc would be important in any case. It's important to consider the "Big > Data" case because if you are lucky, "Small Data" grows into "Big Data." > However the transition can be subtle and badly trap people if the > transition is not anticipated and well thought out. > > On Sun, Mar 30, 2025 at 9:21 AM ufuk yılmaz <uyil...@vivaldi.net.invalid> > wrote: > > > I’m guessing this is not simply retrieving all documents through API > using > > pagination and sending them to index 🤔 About being in-place, how can it > > work when a new Solr version requires a different schema or config file, > > because time to time old definitions don’t work in a new version. > > > > -ufuk > > > > — > > > > > On Mar 30, 2025, at 10:33, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < > > lkotzanie...@bloomberg.net> wrote: > > > > > > Hi Rahul, > > > > > > This sounds very interesting! > > > > > > I enjoyed the discussion at CoC and would be very > > > interested to hear more about the technical details. > > > > > > I am also curious to know more what you mean by "in-place" > > > and what the expectation is around downtime. > > > > > > Either way I am sure this would be a great addition to > > > the tool belt for getting people to finally move off > > > ancient versions of Solr. > > > > > > Look forward to discussing this more on the JIRA! > > > > > > Luke > > > > > > From: users@solr.apache.org At: 03/28/25 01:05:57 UTC-4:00To: > > users@solr.apache.org > > > Subject: Automatic upgrade of Solr indexes over multiple versions > > > > > > Today upgrading from Solr version X to X+2 requires complete > reingestion > > of > > > data from source. This comes from Lucene's constraint which only > > guarantees > > > index compatibility between the version the index was created in and > the > > > immediate next version. > > > > > > > > > This reindexing usually comes with added downtime and/or cost. > Especially > > > in case of deployments which are in customer environments and not > > > completely in control of the vendor, this proposition of having to > > > completely reindex the data can become a hard sell. > > > > > > > > > I have developed a way which achieves this reindexing in-place on the > > same > > > index. Also, the process automatically keeps "upgrading" the indexes > over > > > multiple subsequent Solr upgrades without needing manual intervention. > > > > > > > > > It does come with a limitation that all *source* fields need to be > either > > > stored=true or docValues=true. Any copyField destination fields can be > > > stored=false of course, but as long as the source field (or in general, > > the > > > fields you care about preserving) is either stored or docValues true , > > the > > > tool can reindex in-place and legitimately "upgrade" the index. For > > indexes > > > where this limitation is not a problem (it wasn't for us!), this tool > can > > > remove a lot of operational headaches, especially in environments with > > > hundreds/thousands of very large indexes. > > > > > > > > > I had a conversation about this with some of you during "Apache > Community > > > over Code 2024" in Denver, and I could sense some interest. If this > > feature > > > sounds appealing, I would like to contribute it to Solr on behalf of my > > > employer, Commvault. Please let me know if I should create a JIRA and > get > > > the discussion rolling! > > > > > > > > > Thanks, > > > Rahul Goswami > > > > > > > > > > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > >