Re: Automatic upgrade of Solr indexes over multiple versions

Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) Mon, 31 Mar 2025 11:35:34 -0700

>> the only thing that makes sense is reindexing from source to a new
>> cluster that will replace the old cluster

Ideally yes, but there is a social aspect when Solr is managed as a
service and the many sources are opaque to the team managing it.
Let's assume for the sake of argument that the below is true or
achievable:

>> solution has enough extra capacity

I am interested in this:

>> Another case that might make such a thing interesting would be *if* it was
>> designed to co-locate shards/replicas being reindexed and prevented the
>> need for over the wire transport (caveats about hashing/routing changes,
>> etc). That could speed things up significantly, and a process might look
>> like

Let's assume the shard routing is invariant across versions, if you
were able to create these upgraded local replica from their respective
lower-version source replica, how easily could you stitch these together
again into a Solr Cloud? If you were cloning from a collection that was
receiving some live traffic it might be hard because I imagine you'd
need to know which replica of a particular shard was most up-to-date
and ensure that replica became the leader in the new cloud. So would
this effectively require some kind of special leader election logic
or at least some knowledge of the source transaction log as well?

If we assume a pause to live traffic then this becomes simpler but
then you have the social aspect of coordinating with many teams again.

In our case, we were considering developing a dual write system with
a versionField defined to ensure consistent ordering between the two
clouds, n  and n+m, and having this live outside of Solr. Then, the
actual backfill could be kicked off from some snapshot taken *after*
we enabled dual write. And then finally deleting the old cloud once
we routed traffic to the new one (and let it "bake"). As Gus points
out, at "big data" scale the backfill becomes hard and so the
idea of making this less resource intensive is enticing...

From: users@solr.apache.org At: 03/30/25 14:59:16 UTC-4:00To:  
users@solr.apache.org
Subject: Re: Automatic upgrade of Solr indexes over multiple versions

Some thoughts:

A lot depends on the use case for this sort of thing. In the case of
relatively small installs that can afford to run >2x disk and have
significant query latency headroom this might be useful. However, if a
company is running a large cluster where maintaining excess capacity costs
tens of thousands of dollars, they will often be "cutting it close" on
available storage (I've seen yearly storage costs over 100k some places )
and trying to maintain just enough excess query performance to handle
"normal" spikes in traffic. Adding the load/disk demands of a re-index
within the same cluster (making both query and indexing slower) is usually
a bad idea. Even if you index from the index into a new separate cluster,
that query load to pull the data from the index may place you above
acceptable risk thresholds. For large clusters the only thing that makes
sense is reindexing from source to a new cluster that will replace the old
cluster, because in that way you can (usually) pull the data much faster
without impacting the users. (Notable exceptions crop up in cases where the
original source is a live database also used by the users, then some care
with the query rate is needed again)

I suppose another use case could be if the cluster is being run on bare
metal rather than a service like AWS or a much larger Virtualization
environment. In the bare metal case spinning up new machines for temporary
use is not an option, but again only if the bare metal solution has enough
extra capacity.

Another case that might make such a thing interesting would be *if* it was
designed to co-locate shards/replicas being reindexed and prevented the
need for over the wire transport (caveats about hashing/routing changes,
etc). That could speed things up significantly, and a process might look
like

   1. Upgrade (solr will read index version -1)
   2. Clone to 2x disk cluster
   3. reindex into peer collection (to reset index version counter)
   4. Update alias, delete original collection
   5. Clone to 1x disk cluster
   6. Swap and sunset original upgraded cluster.

If folks have engineered an easy/efficient backup/clone cluster for steps 2
and 5, step 3 could be faster than reindex from originals, reducing
parallel run time (which could save money in large installs)

Clear documentation of limitations and expected load profiles, throttling
etc would be important in any case. It's important to consider the "Big
Data" case because if you are lucky, "Small Data" grows into "Big Data."
However the transition can be subtle and badly trap people if the
transition is not anticipated and well thought out.

On Sun, Mar 30, 2025 at 9:21 AM ufuk yılmaz <uyil...@vivaldi.net.invalid>
wrote:

> I’m guessing this is not simply retrieving all documents through API using
> pagination and sending them to index 🤔 About being in-place, how can it
> work when a new Solr version requires a different schema or config file,
> because time to time old definitions don’t work in a new version.
>
> -ufuk
>
> —
>
> > On Mar 30, 2025, at 10:33, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
> lkotzanie...@bloomberg.net> wrote:
> >
> > Hi Rahul,
> >
> > This sounds very interesting!
> >
> > I enjoyed the discussion at CoC and would be very
> > interested to hear more about the technical details.
> >
> > I am also curious to know more what you mean by "in-place"
> > and what the expectation is around downtime.
> >
> > Either way I am sure this would be a great addition to
> > the tool belt for getting people to finally move off
> > ancient versions of Solr.
> >
> > Look forward to discussing this more on the JIRA!
> >
> > Luke
> >
> > From: users@solr.apache.org At: 03/28/25 01:05:57 UTC-4:00To:
> users@solr.apache.org
> > Subject: Automatic upgrade of Solr indexes over multiple versions
> >
> > Today upgrading from Solr version X to X+2 requires complete reingestion
> of
> > data from source. This comes from Lucene's constraint which only
> guarantees
> > index compatibility between the version the index was created in and the
> > immediate next version.
> >
> >
> > This reindexing usually comes with added downtime and/or cost. Especially
> > in case of deployments which are in customer environments and not
> > completely in control of the vendor, this proposition of having to
> > completely reindex the data can become a hard sell.
> >
> >
> > I have developed a way which achieves this reindexing in-place on the
> same
> > index. Also, the process automatically keeps "upgrading" the indexes over
> > multiple subsequent Solr upgrades without needing manual intervention.
> >
> >
> > It does come with a limitation that all *source* fields need to be either
> > stored=true or docValues=true. Any copyField destination fields can be
> > stored=false of course, but as long as the source field (or in general,
> the
> > fields you care about preserving) is either stored or docValues true ,
> the
> > tool can reindex in-place and legitimately "upgrade" the index. For
> indexes
> > where this limitation is not a problem (it wasn't for us!), this tool can
> > remove a lot of operational headaches, especially in environments with
> > hundreds/thousands of very large indexes.
> >
> >
> > I had a conversation about this with some of you during "Apache Community
> > over Code 2024" in Denver, and I could sense some interest. If this
> feature
> > sounds appealing, I would like to contribute it to Solr on behalf of my
> > employer, Commvault. Please let me know if I should create a JIRA and get
> > the discussion rolling!
> >
> >
> > Thanks,
> > Rahul Goswami
> >
> >
>
>

-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Re: Automatic upgrade of Solr indexes over multiple versions

Reply via email to