Re: solrj: Asynchronous requests and batched updates in Solr Cloud

Markos Volikas Sat, 10 May 2025 05:38:06 -0700

Hi Jason,

Thanks a lot for your detailed reply and for sharing the implementationidea on "update batching". Very much appreciated!

The note about concurrent updates in CloudSolrClient is indeedinteresting. I'll try to find some time to look into it and will let youknow if I make any progress.


Thanks again,

Markos

On 5/5/25 21:06, Jason Gerlowski wrote:

Hi Markos,

I'll answer the easiest question first. The "requestAsync" method is
relatively new to our SolrJ API.  I don't know of any concrete plans,
but I would expect it to be added to more client implementations over
time (and ultimately end up on the SolrClient interface).

Update batching is a different story though.  CloudSolrClient and
ConcurrentUpdateSolrClient offer two fundamentally different
approaches to speeding up update requests.  As you know, the
"Concurrent" client adds documents to a queue internally and streams
them to a single endpoint using batching where possible.  The Cloud
client on the other hand figures out which documents belong to each
shard, and routes documents directly to that shard's leader.  It may
be possible to reconcile those approaches in the future, and have the
"Cloud" client use both optimizations, but I haven't seen much
discussion of that so IMO it's unlikely that you'll see this in an
upcoming release.  It'd be a great improvement to have though, so if
you have any interest in contributing I'm more than happy to review
and do what I can to help move this forward.  Let me know!

In terms of the workarounds you suggested above, I can't suggest any
improvements to your asynchronous-query flow.

But on the "update batching" side, you might see most of the benefits
that the ConcurrentUpdate client if you can find a way to prevent
users from calling the single-document update API that SolrClient's
offer, i.e. `SolrClient.update(SolrInputDocument)`.  The simplest way
to do that might be to create a trivial CloudHttp2SolrClient subclass
that overrides `update(SolrInputDocument)` to throw an
UnsupportedOperationException or some other relevant error.  That'd
nudge other "stormcrawler" devs towards using the batch-update method
that's more similar to what the Concurrent client does under the hood.

Good luck,

Jason

On Sun, Apr 27, 2025 at 9:38 AM Markos Volikas <mvoli...@apache.org> wrote:

Hi all,

I've been working on a feature for Apache StormCrawler
(Incubating) (https://github.com/apache/incubator-stormcrawler/pull/1488),
where we would like to be able to

1. Use CloudHttp2SolrClient

<https://solr.apache.org/docs/9_8_0/solrj/org/apache/solr/client/solrj/impl/CloudHttp2SolrClient.html>
to communicate with a Solr Cloud cluster
2. Send asynchronous query requests as one can do with
Http2SolrClient#requestAsync

<https://solr.apache.org/docs/9_8_0/solrj/org/apache/solr/client/solrj/impl/Http2SolrClient.html#requestAsync(org.apache.solr.client.solrj.SolrRequest,java.lang.String)>
3. Send batched updates like one can do with ConcurrentUpdateSolrClient

<https://solr.apache.org/docs/9_8_0/solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.html>

From what I found, neither (2) nor (3) can be done out of the box, so I
tried the following alternative instead:

For asynchronous query requests:

* Get the wrapped LBHttp2SolrClient of the CloudHttp2SolrClient.
* Get the active Solr endpoints from the cluster state.
* Shuffle the endpoints for basic load balancing.
o From LBHttp2SolrClient#requestAsync:
Execute an asynchronous request against one or more hosts for a
given collection. The passed-in Req object includes a List of
Endpoints. This method always begins with the first Endpoint in
the list and if unsuccessful tries each in turn until the
request is successful. Consequently, this method does not
actually Load Balance. It is up to the caller to shuffle the
List of Endpoints if Load Balancing is desired.
* Make the LBHttp2SolrClient#requestAsync call

Here

<https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java#L66-L96>
is how I have implemented the above in code.

For batching updates, however, the only alternative I can think of is
implementing the batching manually, but this seems convoluted and
probably against the architecture of the CloudSolrClient.

Is there any plan to include asynchronous requests and/or batched
updates in CloudHttp2SolrClient in future Solr releases?

Do you have any suggestions on the alternatives I described above?

Thanks a lot in advance,

Markos Volikas

Re: solrj: Asynchronous requests and batched updates in Solr Cloud

Reply via email to