That's interesting, but brings up the question of what happens if a node (or the whole cluster) is rebooted in the middle of the process?
On Mon, Mar 31, 2025 at 10:02 PM Rahul Goswami <rahul196...@gmail.com> wrote: > Some good points brought up in the discussion. The implementation we have > reindexes a shard reading all documents onto itself, but takes care of the > fact that no older version segment merges with a fresh segment. > This happens with zero downtime and without requiring a large storage > buffer. By the end of the process, you have an index which Solr identifies > as being "created in the newer version". > > We have tested it on 5+ TB indexes and are happy with the results. Some > performance hit for application performance is expected, but for us it is > within acceptable limits. With more inputs from the community, I am sure we > can polish it further. > The goal is to have at least something which will work for a significant > user base, or at least have an option available to decide based on > individual use-cases. > > I am working on the design doc to get the discussion started and will share > the JIRA by tomorrow night. > > -Rahul > > On Mon, Mar 31, 2025 at 1:18 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < > lkotzanie...@bloomberg.net> wrote: > > > >> the only thing that makes sense is reindexing from source to a new > > >> cluster that will replace the old cluster > > > > Ideally yes, but there is a social aspect when Solr is managed as a > > service and the many sources are opaque to the team managing it. > > Let's assume for the sake of argument that the below is true or > > achievable: > > > > >> solution has enough extra capacity > > > > I am interested in this: > > > > >> Another case that might make such a thing interesting would be *if* it > > was > > >> designed to co-locate shards/replicas being reindexed and prevented > the > > >> need for over the wire transport (caveats about hashing/routing > changes, > > >> etc). That could speed things up significantly, and a process might > look > > >> like > > > > Let's assume the shard routing is invariant across versions, if you > > were able to create these upgraded local replica from their respective > > lower-version source replica, how easily could you stitch these together > > again into a Solr Cloud? If you were cloning from a collection that was > > receiving some live traffic it might be hard because I imagine you'd > > need to know which replica of a particular shard was most up-to-date > > and ensure that replica became the leader in the new cloud. So would > > this effectively require some kind of special leader election logic > > or at least some knowledge of the source transaction log as well? > > > > If we assume a pause to live traffic then this becomes simpler but > > then you have the social aspect of coordinating with many teams again. > > > > In our case, we were considering developing a dual write system with > > a versionField defined to ensure consistent ordering between the two > > clouds, n and n+m, and having this live outside of Solr. Then, the > > actual backfill could be kicked off from some snapshot taken *after* > > we enabled dual write. And then finally deleting the old cloud once > > we routed traffic to the new one (and let it "bake"). As Gus points > > out, at "big data" scale the backfill becomes hard and so the > > idea of making this less resource intensive is enticing... > > > > > > From: users@solr.apache.org At: 03/30/25 14:59:16 UTC-4:00To: > > users@solr.apache.org > > Subject: Re: Automatic upgrade of Solr indexes over multiple versions > > > > Some thoughts: > > > > A lot depends on the use case for this sort of thing. In the case of > > relatively small installs that can afford to run >2x disk and have > > significant query latency headroom this might be useful. However, if a > > company is running a large cluster where maintaining excess capacity > costs > > tens of thousands of dollars, they will often be "cutting it close" on > > available storage (I've seen yearly storage costs over 100k some places ) > > and trying to maintain just enough excess query performance to handle > > "normal" spikes in traffic. Adding the load/disk demands of a re-index > > within the same cluster (making both query and indexing slower) is > usually > > a bad idea. Even if you index from the index into a new separate cluster, > > that query load to pull the data from the index may place you above > > acceptable risk thresholds. For large clusters the only thing that makes > > sense is reindexing from source to a new cluster that will replace the > old > > cluster, because in that way you can (usually) pull the data much faster > > without impacting the users. (Notable exceptions crop up in cases where > the > > original source is a live database also used by the users, then some care > > with the query rate is needed again) > > > > I suppose another use case could be if the cluster is being run on bare > > metal rather than a service like AWS or a much larger Virtualization > > environment. In the bare metal case spinning up new machines for > temporary > > use is not an option, but again only if the bare metal solution has > enough > > extra capacity. > > > > Another case that might make such a thing interesting would be *if* it > was > > designed to co-locate shards/replicas being reindexed and prevented the > > need for over the wire transport (caveats about hashing/routing changes, > > etc). That could speed things up significantly, and a process might look > > like > > > > 1. Upgrade (solr will read index version -1) > > 2. Clone to 2x disk cluster > > 3. reindex into peer collection (to reset index version counter) > > 4. Update alias, delete original collection > > 5. Clone to 1x disk cluster > > 6. Swap and sunset original upgraded cluster. > > > > If folks have engineered an easy/efficient backup/clone cluster for > steps 2 > > and 5, step 3 could be faster than reindex from originals, reducing > > parallel run time (which could save money in large installs) > > > > Clear documentation of limitations and expected load profiles, throttling > > etc would be important in any case. It's important to consider the "Big > > Data" case because if you are lucky, "Small Data" grows into "Big Data." > > However the transition can be subtle and badly trap people if the > > transition is not anticipated and well thought out. > > > > On Sun, Mar 30, 2025 at 9:21 AM ufuk yılmaz <uyil...@vivaldi.net.invalid > > > > wrote: > > > > > I’m guessing this is not simply retrieving all documents through API > > using > > > pagination and sending them to index 🤔 About being in-place, how can > it > > > work when a new Solr version requires a different schema or config > file, > > > because time to time old definitions don’t work in a new version. > > > > > > -ufuk > > > > > > — > > > > > > > On Mar 30, 2025, at 10:33, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) > < > > > lkotzanie...@bloomberg.net> wrote: > > > > > > > > Hi Rahul, > > > > > > > > This sounds very interesting! > > > > > > > > I enjoyed the discussion at CoC and would be very > > > > interested to hear more about the technical details. > > > > > > > > I am also curious to know more what you mean by "in-place" > > > > and what the expectation is around downtime. > > > > > > > > Either way I am sure this would be a great addition to > > > > the tool belt for getting people to finally move off > > > > ancient versions of Solr. > > > > > > > > Look forward to discussing this more on the JIRA! > > > > > > > > Luke > > > > > > > > From: users@solr.apache.org At: 03/28/25 01:05:57 UTC-4:00To: > > > users@solr.apache.org > > > > Subject: Automatic upgrade of Solr indexes over multiple versions > > > > > > > > Today upgrading from Solr version X to X+2 requires complete > > reingestion > > > of > > > > data from source. This comes from Lucene's constraint which only > > > guarantees > > > > index compatibility between the version the index was created in and > > the > > > > immediate next version. > > > > > > > > > > > > This reindexing usually comes with added downtime and/or cost. > > Especially > > > > in case of deployments which are in customer environments and not > > > > completely in control of the vendor, this proposition of having to > > > > completely reindex the data can become a hard sell. > > > > > > > > > > > > I have developed a way which achieves this reindexing in-place on the > > > same > > > > index. Also, the process automatically keeps "upgrading" the indexes > > over > > > > multiple subsequent Solr upgrades without needing manual > intervention. > > > > > > > > > > > > It does come with a limitation that all *source* fields need to be > > either > > > > stored=true or docValues=true. Any copyField destination fields can > be > > > > stored=false of course, but as long as the source field (or in > general, > > > the > > > > fields you care about preserving) is either stored or docValues true > , > > > the > > > > tool can reindex in-place and legitimately "upgrade" the index. For > > > indexes > > > > where this limitation is not a problem (it wasn't for us!), this tool > > can > > > > remove a lot of operational headaches, especially in environments > with > > > > hundreds/thousands of very large indexes. > > > > > > > > > > > > I had a conversation about this with some of you during "Apache > > Community > > > > over Code 2024" in Denver, and I could sense some interest. If this > > > feature > > > > sounds appealing, I would like to contribute it to Solr on behalf of > my > > > > employer, Commvault. Please let me know if I should create a JIRA and > > get > > > > the discussion rolling! > > > > > > > > > > > > Thanks, > > > > Rahul Goswami > > > > > > > > > > > > > > > > > > -- > > http://www.needhamsoftware.com (work) > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > > > > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)