RE: Multiply connected data search

Andrew Witt Mon, 30 Dec 2024 09:12:48 -0800

I have a few observations:

First, long experience tells me to avoid field names like "other_ids", because 
what "other" means is context-dependent.  Sure, it's all clear to you now, but 
would it be clear to someone else?  Will it be clear to you a year from now?  
What will you do when there is a third thing -- will you add a field called 
"the_other_other_ids"?  Just name the fields for what they are : give books an 
"author_ids" field and authors a "book_ids" field.  That way, the data schema 
speaks for itself, without needing to reference any logic that operates over it.

Second, look into Solr's "nested documents".  We use them to great success 
(although for a somewhat different usecase.)  Per Solr's nested documents page, 
" indexing the relationships between documents usually yields much faster 
queries than an equivalent 'query time join' ", so they might be exactly what 
you are looking for.

Third, I suspect you are prematurely optimizing.  Don’t be afraid of needing to 
query Solr with a follow-up query.  It's sooooo fast.  As with any engineering 
problem, you should measure before deciding whether or not it's actually a 
problem.  It could very well turn out that a simpler solution which requires a 
primary query and a follow-up query is negligibly slower than a solution that 
takes you weeks or months to design, build, test, and finalize.

Hope this helps.

Andrew Witt
Senior Software Engineer II
Learning A-Z, a Cambium Learning® Group Company
andrew.w...@learninga-z.com

-----Original Message-----
From: Nikola Smolenski <smolen...@unilib.rs> 
Sent: Thursday, December 26, 2024 6:17 AM
To: users@solr.apache.org
Subject: Re: Multiply connected data search

I agree, solr should not be used as the primary data store. However, it would 
still be handy to be able to retrieve as much information in a single query as 
possible.

I am experimenting with a solution where every solr document has "otherids"
multivalued field, with books having the ids of all the authors who 
contributed, authors having the ids of all the books they authored, and every 
document including its own id in the list; then, everything can be extracted 
using a single join query.

Does anyone see any drawbacks to this solution?

On Tue, Dec 24, 2024 at 6:22 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> Do not use Solr as your primary data store. Solr is not a database. 
> Put your data in a relational database where it is easy to track all 
> those relationships and update them correctly.
>
> Extract the needed fields and load them into Solr.
>
> This can be a daily full dump and load job. That is what I did at 
> Chegg with millions of books. That is simple and fast, should be under 
> an hour for the whole job.
>
> An alternative to the all-in-one _text_ field is to use edismax and 
> give different weights to the different fields. Something like this, 
> with higher weighting for phrase matches.
>
> <qf>title^4 authors</qf>
> <pf>title^8 authors^2</qf>
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Dec 23, 2024, at 10:07 PM, Nikola Smolenski <smolen...@unilib.rs>
> wrote:
> >
> > Thank you for the suggestion, but that wouldn't work because there 
> > could
> be
> > multiple authors with the same name, who differ only by ID. If I 
> > were to change the name of an author, I wouldn't know which one 
> > should I change
> and
> > which one should stay. Additionally, there could be additional 
> > author information, such as external identifiers, that needs to be 
> > connected to the author.
> >
> > On Mon, Dec 23, 2024 at 11:07 PM Dmitri Maziuk 
> > <dmitri.maz...@gmail.com>
> > wrote:
> >
> >> On 12/23/24 15:49, Nikola Smolenski wrote:
> >> ...
> >>> About the only way of doing this I can think of is to perform the
> search,
> >>> get all the found books and authors, then perform another query 
> >>> that fetches all the books and authors referenced by any of books 
> >>> or authors
> >> in
> >>> the first query. Is there a smarter way of doing this? What are 
> >>> the
> best
> >>> practices?
> >>>
> >>
> >> A book is a "document" that has a title and authors as separate fields.
> >> Documents usually also have a "big search" field, called _text_ in 
> >> the default config.
> >>
> >> Copy both author list and title into _text_, search in _text_, 
> >> facet on authors and/or titles.
> >>
> >> Dima
> >>
> >>
>
>

RE: Multiply connected data search

Reply via email to