Solr streaming may be useful in your case. It can also execute "joins"
across different solr cloud instances and also has a SQL facade.

Parallel sql
https://solr.apache.org/guide/8_6/parallel-sql-interface.html

lower level streaming interface
https://solr.apache.org/guide/8_6/streaming-expressions.html#streaming-expressions

sorting would only be possible on "common" fields between solrcloud schemas



On Thu, Apr 15, 2021 at 1:24 AM Norbert Bodnar <bod...@inqool.cz> wrote:

> Hi.
>
> Hope you are doing well.
>
> I would like to start with introducing the problem we are facing.
>
> Our(let's call us CompanyB working on WebB) primary goal is to create a web
> application, which will expand an existing web(WebA developed by CompanyA)
> application's searching possibilities.
>
> Let's call this existing web application WebA. WebA is deployed on multiple
> clients and every client has its own set of documents. On bigger instances,
> the index is updated almost every minute and the size of the index is
> sometimes even a few hundred GB. WebA was developed by CompanyA and
> provides basic searching capabilities, such as keyword search, faceting on
> authors, classic stuff.
>
> Now comes CompanyB with a project to build WebB, which should be a web
> service built on existing WebA instances. The main goal is to take the
> content of WebA, run some natural language processing on it, obtain
> metadata about documents from external sources and index it, so the new
> WebB will provide advanced searching capabilities on these added metadata.
> WebB must also be able to search on fields from the original index, such as
> author, title, but also be able to search on these new fields, effectively
> combining the results.
>
> Both WebA and WebB are for the same customer, so a bit of collaboration
> between CompanyA and CompanyB is possible, however we should limit the
> amount of work needed on WebA by CompanyA to minimum.
>
> The issue is the following:
> The first idea was that WebB will replicate the index of WebA and has its
> own copy of the index. When new metadata is gathered for a document, it
> will update the document in its own index. However, since both projects are
> for the same customer, doubling the index was a bit of a problem. Also,
> this idea was based on the presumption that the original index is not
> updated very often, which we found out is not true.
>
> The second approach was to use the existing index, so duplicating data on
> two indexes would not be needed. But since the original index is used by
> WebA and worked on by a different company, it would mean a lot of work for
> them and it probably wouldn't be possible. On reindex, they would not have
> access to new metadata that was posted into the index by CompanyB/WebB,
> querying from WebB could affect performance on WebA, WebA would need to
> filter the results to not return fields that were put there by WebB and so
> on...
>
> The third idea is probably the closest to reality, but it also might not
> be. The idea was to create a new shard/node in the existing cluster of
> WebA's Solr, where only the new metadata will be indexed. However, since we
> need to be able to search and return results from both of these nodes
> (queries like "author:ABC && metadata_one:123"), I believe it would also
> not be possible. The documents in the new node containing only additional
> metadata should be represented in the same document in a response, since we
> want the result to contain both fields from original node (author, title,
> ...) and fields from new node (metadata1, namedEntities, ...). The results
> of querying, faceting and sorting should also be somehow combined.
>
> We also considered CDCR, effectively keeping a synchronized copy of the
> original index but the use case for CDCR is a bit different, the target
> cluster should not be updated without updates on the source cluster, which
> is not what we want. We want to have an up-to-date original index enriched
> with additional fields.
>
>
> I hope my explanation is clear enough and I will appreciate your help.
>
> Thank you for your time, and have a nice day :),
> Norbert Bodnar
>

Reply via email to