Solr streaming may be useful in your case. It can also execute "joins" across different solr cloud instances and also has a SQL facade.
Parallel sql https://solr.apache.org/guide/8_6/parallel-sql-interface.html lower level streaming interface https://solr.apache.org/guide/8_6/streaming-expressions.html#streaming-expressions sorting would only be possible on "common" fields between solrcloud schemas On Thu, Apr 15, 2021 at 1:24 AM Norbert Bodnar <bod...@inqool.cz> wrote: > Hi. > > Hope you are doing well. > > I would like to start with introducing the problem we are facing. > > Our(let's call us CompanyB working on WebB) primary goal is to create a web > application, which will expand an existing web(WebA developed by CompanyA) > application's searching possibilities. > > Let's call this existing web application WebA. WebA is deployed on multiple > clients and every client has its own set of documents. On bigger instances, > the index is updated almost every minute and the size of the index is > sometimes even a few hundred GB. WebA was developed by CompanyA and > provides basic searching capabilities, such as keyword search, faceting on > authors, classic stuff. > > Now comes CompanyB with a project to build WebB, which should be a web > service built on existing WebA instances. The main goal is to take the > content of WebA, run some natural language processing on it, obtain > metadata about documents from external sources and index it, so the new > WebB will provide advanced searching capabilities on these added metadata. > WebB must also be able to search on fields from the original index, such as > author, title, but also be able to search on these new fields, effectively > combining the results. > > Both WebA and WebB are for the same customer, so a bit of collaboration > between CompanyA and CompanyB is possible, however we should limit the > amount of work needed on WebA by CompanyA to minimum. > > The issue is the following: > The first idea was that WebB will replicate the index of WebA and has its > own copy of the index. When new metadata is gathered for a document, it > will update the document in its own index. However, since both projects are > for the same customer, doubling the index was a bit of a problem. Also, > this idea was based on the presumption that the original index is not > updated very often, which we found out is not true. > > The second approach was to use the existing index, so duplicating data on > two indexes would not be needed. But since the original index is used by > WebA and worked on by a different company, it would mean a lot of work for > them and it probably wouldn't be possible. On reindex, they would not have > access to new metadata that was posted into the index by CompanyB/WebB, > querying from WebB could affect performance on WebA, WebA would need to > filter the results to not return fields that were put there by WebB and so > on... > > The third idea is probably the closest to reality, but it also might not > be. The idea was to create a new shard/node in the existing cluster of > WebA's Solr, where only the new metadata will be indexed. However, since we > need to be able to search and return results from both of these nodes > (queries like "author:ABC && metadata_one:123"), I believe it would also > not be possible. The documents in the new node containing only additional > metadata should be represented in the same document in a response, since we > want the result to contain both fields from original node (author, title, > ...) and fields from new node (metadata1, namedEntities, ...). The results > of querying, faceting and sorting should also be somehow combined. > > We also considered CDCR, effectively keeping a synchronized copy of the > original index but the use case for CDCR is a bit different, the target > cluster should not be updated without updates on the source cluster, which > is not what we want. We want to have an up-to-date original index enriched > with additional fields. > > > I hope my explanation is clear enough and I will appreciate your help. > > Thank you for your time, and have a nice day :), > Norbert Bodnar >