Re: Solr Collections Join

Norbert Kutasi Mon, 14 Mar 2022 15:24:14 -0700

hi Venkat,

We are on SOLR 8.5.2 and actually it's possible to join 2 or multiple
collections and get fields from them similarly, what a Left Outer join does
in an RDBMS. It will create nested documents rather than attribute merges
which might not be what you are looking for.

If you tried *subquery *document transformer:
https://solr.apache.org/guide/8_5/transforming-result-documents.html#subquery
you can skip the rest of my message :)

In your question, first you have referred to !join that is actually query
parser allowing you to filter on Collection A based on documents in
Collection B.
It's quite well explained here:
https://solr.apache.org/guide/8_5/other-parsers.html#join-query-parser

We rely on this as well as part of our Access Control List implementation.

Here is a JSON POST sample against another collection e.g Clients where we
enforce Users who are part of the collection called: ACL get to retrieve
Clients of their own Teams.

/Clients/query
{
  "query" : "*:*",
  "filter" : ["{!join from=teamId fromIndex=ACL to=teamId}userId:ABC123"],
  "fields" : "*",
  "sort"   : "id desc",
  "limit"  : "10"
}

But this is not what you have really asked about.

When we came across the requirement of cross collection data retrieval, we
tried to implement it on SOLR side rather than in our custom API layer.

Let's take 3 Collections: Clients, Managers, Products and how we can
use *subquery
*transformer.

Constraints:

   - One core of the Collection on the "to" side(s) ie:  A->B has to exist
   on the node and must have a single shard and a replica on all Solr nodes
   where the collection. At least this is what we experienced on 8.5
   - In the syntax of referring to Collection B in subquery, it actually
   corresponds to the core_name and not the Collection. Cores are called
   differently in your SOLRCloud varying on replica by replica like :
   Managers_shard1_replica_n1,  Managers_shard1_replica_n2 etc. You have to
   assign them a generic name.

Here we want to retrieve Clients and their Managers and Product Details
that Clients purchased.

Clients
Managers
Products

/Clients/query
{
"query":"*:*",
"filter":["region:NA"],
"fields": "*,managerDetails:[subquery
fromIndex=Managers],productDetails:[subquery fromIndex=Products]",
"limit": 100,
"offset":0,
"sort":"id asc",
"params":{
"managerDetails.fl":"*",
"managerDetails.q": "*",
"managerDetails.fq": "{!term f=managerList
v=$row.managerId}","managerDetails.rows": 10,
"productDetails.fl":"*",
"productDetails.q": "*",
"productDetails.fq": "{!term f=productPurchasedList v=$row.productId}",
"productDetails.rows": 10}}

In order to make this work you need to rename Managers and Product cores to
a generic one.

https://solr.apache.org/guide/8_5/coreadmin-api.html#coreadmin-rename
admin/cores?action=RENAME&core=Managers_shard1_replica_n1&other=Managers
admin/cores?action=RENAME&core=Products_shard1_replica_n1&other=Products

Managers and Products has to exists on the Node that you hit with
/Clients/query

It's also possible to create deeply nested documents, where the 3rd level
may capture Products that the Managers (in general) responsible for.

Clients
Managers
Products

/Clients/query
{
"query":"*:*",
"filter":["region:NA"],
"fields": "*,managerDetails:[subquery fromIndex=Managers]",
"limit": 100,
"offset":0,
"sort":"id asc",
"params":{
"managerDetails.fl":"*,productDetails:[subquery fromIndex=Products]",
"managerDetails.q": "*",
"managerDetails.fq": "{!term f=managerList v=$row.managerId}",
"managerDetails.rows": 10,
"managerDetails.productDetails.fl":"*",
"managerDetails.productDetails.q": "*",
"managerDetails.productDetails.fq": "{!term f=productIdExpertise
v=$row.productId}",
"managerDetails.productDetails.rows": 10}}

Using "fl" you can narrow down the attributes wanted to bring in from other
collections.

Subquery works for you even within a single collection to generate
arbitrary document hierarchies.

Using the example above, when a collection can host Customers, Products and
Managers type documents you can miss the fromIndex part.

The reason we decided to employ *subquery *was in order to provide the
utmost flexibility including to host these Collections as separate
endpoints and avoid enforcing any modelling constraints early on like
parent / child hierarchy.
We started out as a single collection of 60 millions documents. After
noticing some scalability issues when returning a high number of objects
(like 2'000 Clients) we learnt that breaking them into discrete ones would
provide better performance.

Regards,
Norbert

On Mon, 14 Mar 2022 at 14:06, Mikhail Khludnev <m...@apache.org> wrote:

> Hi, Venkat.
> No way. Sorry.
>
> On Fri, Mar 11, 2022 at 4:59 PM Venkateswarlu Bommineni <bvr...@gmail.com>
> wrote:
>
> > Yes, it is solrcloud setup.
> >
> > On Fri, Mar 11, 2022 at 12:57 AM Srijan <shree...@gmail.com> wrote:
> >
> > > Is this a SolrCloud setup?
> > >
> > > On Thu, Mar 10, 2022, 22:25 Venkateswarlu Bommineni <bvr...@gmail.com>
> > > wrote:
> > >
> > > > Hello All,
> > > >
> > > > I have a requirement to join 2 collections and get fields from both
> the
> > > > collections.
> > > >
> > > > I have got the join query as below, when i run below join query I am
> > > > getting the fields of Collection1 only.
> > > >
> > > > is There any way I can get the fields from collection2 as well ?
> > > >
> > > > Running below query on Collection1.
> > > > {!join method="crossCollection" fromIndex="collection2" from="id"
> > to="id"
> > > > v="*:*"}
> > > >
> > > >
> > > > Any help here is much appreciated !!
> > > >
> > > > Thanks,
> > > > Venkat.
> > > >
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Solr Collections Join

Reply via email to