Some good points brought up in the discussion. The implementation we have
reindexes a shard reading all documents onto itself, but takes care of the
fact that no older version segment merges with a fresh segment.
This happens with zero downtime and without requiring a large storage
buffer. By the end of the process, you have an index which Solr identifies
as being "created in the newer version".

We have tested it on 5+ TB indexes and are happy with the results. Some
performance hit for application performance is expected, but for us it is
within acceptable limits. With more inputs from the community, I am sure we
can polish it further.
The goal is to have at least something which will work for a significant
user base, or at least have an option available to decide based on
individual use-cases.

I am working on the design doc to get the discussion started and will share
the JIRA by tomorrow night.

-Rahul

On Mon, Mar 31, 2025 at 1:18 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzanie...@bloomberg.net> wrote:

> >> the only thing that makes sense is reindexing from source to a new
> >> cluster that will replace the old cluster
>
> Ideally yes, but there is a social aspect when Solr is managed as a
> service and the many sources are opaque to the team managing it.
> Let's assume for the sake of argument that the below is true or
> achievable:
>
> >> solution has enough extra capacity
>
> I am interested in this:
>
> >> Another case that might make such a thing interesting would be *if* it
> was
> >> designed to co-locate shards/replicas being reindexed and prevented the
> >> need for over the wire transport (caveats about hashing/routing changes,
> >> etc). That could speed things up significantly, and a process might look
> >> like
>
> Let's assume the shard routing is invariant across versions, if you
> were able to create these upgraded local replica from their respective
> lower-version source replica, how easily could you stitch these together
> again into a Solr Cloud? If you were cloning from a collection that was
> receiving some live traffic it might be hard because I imagine you'd
> need to know which replica of a particular shard was most up-to-date
> and ensure that replica became the leader in the new cloud. So would
> this effectively require some kind of special leader election logic
> or at least some knowledge of the source transaction log as well?
>
> If we assume a pause to live traffic then this becomes simpler but
> then you have the social aspect of coordinating with many teams again.
>
> In our case, we were considering developing a dual write system with
> a versionField defined to ensure consistent ordering between the two
> clouds, n  and n+m, and having this live outside of Solr. Then, the
> actual backfill could be kicked off from some snapshot taken *after*
> we enabled dual write. And then finally deleting the old cloud once
> we routed traffic to the new one (and let it "bake"). As Gus points
> out, at "big data" scale the backfill becomes hard and so the
> idea of making this less resource intensive is enticing...
>
>
> From: users@solr.apache.org At: 03/30/25 14:59:16 UTC-4:00To:
> users@solr.apache.org
> Subject: Re: Automatic upgrade of Solr indexes over multiple versions
>
> Some thoughts:
>
> A lot depends on the use case for this sort of thing. In the case of
> relatively small installs that can afford to run >2x disk and have
> significant query latency headroom this might be useful. However, if a
> company is running a large cluster where maintaining excess capacity costs
> tens of thousands of dollars, they will often be "cutting it close" on
> available storage (I've seen yearly storage costs over 100k some places )
> and trying to maintain just enough excess query performance to handle
> "normal" spikes in traffic. Adding the load/disk demands of a re-index
> within the same cluster (making both query and indexing slower) is usually
> a bad idea. Even if you index from the index into a new separate cluster,
> that query load to pull the data from the index may place you above
> acceptable risk thresholds. For large clusters the only thing that makes
> sense is reindexing from source to a new cluster that will replace the old
> cluster, because in that way you can (usually) pull the data much faster
> without impacting the users. (Notable exceptions crop up in cases where the
> original source is a live database also used by the users, then some care
> with the query rate is needed again)
>
> I suppose another use case could be if the cluster is being run on bare
> metal rather than a service like AWS or a much larger Virtualization
> environment. In the bare metal case spinning up new machines for temporary
> use is not an option, but again only if the bare metal solution has enough
> extra capacity.
>
> Another case that might make such a thing interesting would be *if* it was
> designed to co-locate shards/replicas being reindexed and prevented the
> need for over the wire transport (caveats about hashing/routing changes,
> etc). That could speed things up significantly, and a process might look
> like
>
>    1. Upgrade (solr will read index version -1)
>    2. Clone to 2x disk cluster
>    3. reindex into peer collection (to reset index version counter)
>    4. Update alias, delete original collection
>    5. Clone to 1x disk cluster
>    6. Swap and sunset original upgraded cluster.
>
> If folks have engineered an easy/efficient backup/clone cluster for steps 2
> and 5, step 3 could be faster than reindex from originals, reducing
> parallel run time (which could save money in large installs)
>
> Clear documentation of limitations and expected load profiles, throttling
> etc would be important in any case. It's important to consider the "Big
> Data" case because if you are lucky, "Small Data" grows into "Big Data."
> However the transition can be subtle and badly trap people if the
> transition is not anticipated and well thought out.
>
> On Sun, Mar 30, 2025 at 9:21 AM ufuk yılmaz <uyil...@vivaldi.net.invalid>
> wrote:
>
> > I’m guessing this is not simply retrieving all documents through API
> using
> > pagination and sending them to index 🤔 About being in-place, how can it
> > work when a new Solr version requires a different schema or config file,
> > because time to time old definitions don’t work in a new version.
> >
> > -ufuk
> >
> > —
> >
> > > On Mar 30, 2025, at 10:33, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
> > lkotzanie...@bloomberg.net> wrote:
> > >
> > > Hi Rahul,
> > >
> > > This sounds very interesting!
> > >
> > > I enjoyed the discussion at CoC and would be very
> > > interested to hear more about the technical details.
> > >
> > > I am also curious to know more what you mean by "in-place"
> > > and what the expectation is around downtime.
> > >
> > > Either way I am sure this would be a great addition to
> > > the tool belt for getting people to finally move off
> > > ancient versions of Solr.
> > >
> > > Look forward to discussing this more on the JIRA!
> > >
> > > Luke
> > >
> > > From: users@solr.apache.org At: 03/28/25 01:05:57 UTC-4:00To:
> > users@solr.apache.org
> > > Subject: Automatic upgrade of Solr indexes over multiple versions
> > >
> > > Today upgrading from Solr version X to X+2 requires complete
> reingestion
> > of
> > > data from source. This comes from Lucene's constraint which only
> > guarantees
> > > index compatibility between the version the index was created in and
> the
> > > immediate next version.
> > >
> > >
> > > This reindexing usually comes with added downtime and/or cost.
> Especially
> > > in case of deployments which are in customer environments and not
> > > completely in control of the vendor, this proposition of having to
> > > completely reindex the data can become a hard sell.
> > >
> > >
> > > I have developed a way which achieves this reindexing in-place on the
> > same
> > > index. Also, the process automatically keeps "upgrading" the indexes
> over
> > > multiple subsequent Solr upgrades without needing manual intervention.
> > >
> > >
> > > It does come with a limitation that all *source* fields need to be
> either
> > > stored=true or docValues=true. Any copyField destination fields can be
> > > stored=false of course, but as long as the source field (or in general,
> > the
> > > fields you care about preserving) is either stored or docValues true ,
> > the
> > > tool can reindex in-place and legitimately "upgrade" the index. For
> > indexes
> > > where this limitation is not a problem (it wasn't for us!), this tool
> can
> > > remove a lot of operational headaches, especially in environments with
> > > hundreds/thousands of very large indexes.
> > >
> > >
> > > I had a conversation about this with some of you during "Apache
> Community
> > > over Code 2024" in Denver, and I could sense some interest. If this
> > feature
> > > sounds appealing, I would like to contribute it to Solr on behalf of my
> > > employer, Commvault. Please let me know if I should create a JIRA and
> get
> > > the discussion rolling!
> > >
> > >
> > > Thanks,
> > > Rahul Goswami
> > >
> > >
> >
> >
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>
>
>

Reply via email to