Re: Multiply connected data search

Walter Underwood Sat, 04 Jan 2025 12:53:54 -0800

The elegant solution in Solr is a flat schema. You are not doing “database 
design”. You are doing search schema design. They are very different.


Do. Not. Do. Joins.

List the fields which need to be searched. Make those indexed. For a small 
amount of data (under a million docs), it can help debugging if they are also 
stored.

List the fields that need to be displayed. Make those stored, but not indexed.

When linked data in in the source database is updated, reload the documents 
into Solr. Up to maybe 5-10 million documents, it is quite reasonable to reload 
everything once per day. Your book data will not change that frequently.

I ran search for Netflix, which is not that different from searching books. I 
also ran search for Chegg, searching textbooks.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 4, 2025, at 1:46 PM, Nikola Smolenski <smolen...@unilib.rs> wrote:
> 
> On Mon, Dec 30, 2024 at 6:13 PM Andrew Witt <andrew.w...@learninga-z.com>
> wrote:
> 
>> First, long experience tells me to avoid field names like "other_ids",
>> because what "other" means is context-dependent.  Sure, it's all clear to
>> you now, but would it be clear to someone else?  Will it be clear to you a
>> year from now?  What will you do when there is a third thing -- will you
>> add a field called "the_other_other_ids"?  Just name the fields for what
>> they are : give books an "author_ids" field and authors a "book_ids"
>> field.  That way, the data schema speaks for itself, without needing to
>> reference any logic that operates over it.
>> 
> 
> I agree, this is obviously much better from the database design point of
> view, but it can't be used with the join query parser, since it can only
> take one field as the from join field. Perhaps the join query parser could
> be updated so that it supports multiple fields as the from join fields, in
> which case this would be the preferable approach.
> 
> 
>> Second, look into Solr's "nested documents".  We use them to great success
>> (although for a somewhat different usecase.)  Per Solr's nested documents
>> page, " indexing the relationships between documents usually yields much
>> faster queries than an equivalent 'query time join' ", so they might be
>> exactly what you are looking for.
>> 
> 
> Yes, this would make sense, except in this case, as I said, one book can
> have multiple authors, and one author could author multiple books, so
> nested documents can't really be used.
> 
> 
>> Third, I suspect you are prematurely optimizing.  Don’t be afraid of
>> needing to query Solr with a follow-up query.  It's sooooo fast.  As with
>> any engineering problem, you should measure before deciding whether or not
>> it's actually a problem.  It could very well turn out that a simpler
>> solution which requires a primary query and a follow-up query is negligibly
>> slower than a solution that takes you weeks or months to design, build,
>> test, and finalize.
>> 
> 
> I am in fact searching for a more elegant solution. I believe I have found
> it, but I haven't checked whether it's faster also. I believe it should be
> faster, in principle, since Solr is just doing internally what I have to do
> externally anyway. A more elegant solution should also be easier to build
> and improve upon.
> 
> -----Original Message-----
>> From: Nikola Smolenski <smolen...@unilib.rs>
>> Sent: Thursday, December 26, 2024 6:17 AM
>> To: users@solr.apache.org
>> Subject: Re: Multiply connected data search
>> 
>> I agree, solr should not be used as the primary data store. However, it
>> would still be handy to be able to retrieve as much information in a single
>> query as possible.
>> 
>> I am experimenting with a solution where every solr document has "otherids"
>> multivalued field, with books having the ids of all the authors who
>> contributed, authors having the ids of all the books they authored, and
>> every document including its own id in the list; then, everything can be
>> extracted using a single join query.
>> 
>> Does anyone see any drawbacks to this solution?
>> 
>> On Tue, Dec 24, 2024 at 6:22 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> Do not use Solr as your primary data store. Solr is not a database.
>>> Put your data in a relational database where it is easy to track all
>>> those relationships and update them correctly.
>>> 
>>> Extract the needed fields and load them into Solr.
>>> 
>>> This can be a daily full dump and load job. That is what I did at
>>> Chegg with millions of books. That is simple and fast, should be under
>>> an hour for the whole job.
>>> 
>>> An alternative to the all-in-one _text_ field is to use edismax and
>>> give different weights to the different fields. Something like this,
>>> with higher weighting for phrase matches.
>>> 
>>> <qf>title^4 authors</qf>
>>> <pf>title^8 authors^2</qf>
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Dec 23, 2024, at 10:07 PM, Nikola Smolenski <smolen...@unilib.rs>
>>> wrote:
>>>> 
>>>> Thank you for the suggestion, but that wouldn't work because there
>>>> could
>>> be
>>>> multiple authors with the same name, who differ only by ID. If I
>>>> were to change the name of an author, I wouldn't know which one
>>>> should I change
>>> and
>>>> which one should stay. Additionally, there could be additional
>>>> author information, such as external identifiers, that needs to be
>>>> connected to the author.
>>>> 
>>>> On Mon, Dec 23, 2024 at 11:07 PM Dmitri Maziuk
>>>> <dmitri.maz...@gmail.com>
>>>> wrote:
>>>> 
>>>>> On 12/23/24 15:49, Nikola Smolenski wrote:
>>>>> ...
>>>>>> About the only way of doing this I can think of is to perform the
>>> search,
>>>>>> get all the found books and authors, then perform another query
>>>>>> that fetches all the books and authors referenced by any of books
>>>>>> or authors
>>>>> in
>>>>>> the first query. Is there a smarter way of doing this? What are
>>>>>> the
>>> best
>>>>>> practices?
>>>>>> 
>>>>> 
>>>>> A book is a "document" that has a title and authors as separate
>> fields.
>>>>> Documents usually also have a "big search" field, called _text_ in
>>>>> the default config.
>>>>> 
>>>>> Copy both author list and title into _text_, search in _text_,
>>>>> facet on authors and/or titles.
>>>>> 
>>>>> Dima
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Multiply connected data search

Reply via email to