Re: Relevancy debugging - idf score

Walter Underwood Mon, 06 Dec 2021 09:29:28 -0800

When we tried exact IDF, it was about 10X slower in our sharded system, so we 
couldn’t use it.


It is possible to calculate IDF when merging results from shards, with no speed 
penalty. Infoseek was doing that 25 years ago and the patent has expired. You 
return df from each shard, then calculate idf by adding the shard dfs. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 6, 2021, at 3:08 AM, Alessandro Benedetti <a.benede...@sease.io> wrote:
> 
> Good to know you solved it!
> Yes, Distributed IDF is definitely a problem in case you have skewed
> documents distributions.
> 
> Cheers
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
> 
> www.sease.io
> 
> 
> On Sun, 5 Dec 2021 at 17:19, Sjoerd Smeets <ssme...@gmail.com> wrote:
> 
>> Found it!
>> 
>> I had to enable the
>> ExactStatsCache
>> 
>> Found a description over here. Thanks for pointing me in the right
>> direction.
>> 
>> https://solr.pl/en/2019/05/20/distributed-idf/
>> 
>> 
>> On Sun, Dec 5, 2021 at 11:09 AM Sjoerd Smeets <ssme...@gmail.com> wrote:
>> 
>>> Hi Allessandro,
>>> 
>>> Thanks for your reply! Yes, the document are in the same result list and
>>> I'm not doing any indexing at the moment and executed a commit just to be
>>> sure. Still the same result. It is an environment with 4 shards. Perhaps
>>> that plays a factor?
>>> 
>>> Thanks,
>>> Sjoerd
>>> 
>>> On Sun, Dec 5, 2021 at 11:02 AM Alessandro Benedetti <
>>> a.benede...@sease.io> wrote:
>>> 
>>>> It's seems like the underline index changed.
>>>> Are those two documents in the same result set?
>>>> Is it just one query?
>>>> It's definitely curious, even if a commit happened search results are
>>>> consistent in one searcher.
>>>> 
>>>> 
>>>> On Sun, 5 Dec 2021, 16:28 Sjoerd Smeets, <ssme...@gmail.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm debugging the relevancy scores of my query and I see the following
>>>>> for
>>>>> two documents hits. My question is, why is the idf score not the same
>>>>> for
>>>>> both documents? This is Solr 6.6.
>>>>> 
>>>>> Any guidance would be much appreciated.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> *Doc1*
>>>>> "71d72354eea23b9eae934ab616e8ce38de69d760": "
>>>>> 104.994415 = sum of:
>>>>>  104.994415 = sum of:
>>>>>    82.89969 = weight(stemmed_data.timenote.narratives:remedi in 22470)
>>>>> [SchemaSimilarity], result of:
>>>>>      82.89969 = score(freq=9.0), computed as boost * idf * tf from:
>>>>>        100.0 = boost
>>>>>        0.87546873 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>>> from:
>>>>>          *52 = n, number of documents containing term*
>>>>>          *125 = N, total number of documents with field*
>>>>>        0.9469177 = tf, computed as freq / (freq + k1 * (1 - b + b * dl
>>>>> /
>>>>> avgdl)) from:
>>>>>          9.0 = freq, occurrences of term within document
>>>>>          1.2 = k1, term saturation parameter
>>>>>          0.75 = b, length normalization parameter
>>>>>          12312.0 = dl, length of field (approximate)
>>>>>          54179.03 = avgdl, average length of field
>>>>>    22.09473 = weight(stemmed_data.timenote.matters:remedi in 22470)
>>>>> [SchemaSimilarity], result of:
>>>>>      22.09473 = score(freq=4.0), computed as boost * idf * tf from:
>>>>>        10.0 = boost
>>>>>        2.4308395 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>>> from:
>>>>>          *9 = n, number of documents containing term*
>>>>>          *107 = N, total number of documents with field*
>>>>>        0.9089341 = tf, computed as freq / (freq + k1 * (1 - b + b * dl
>>>>> /
>>>>> avgdl)) from:
>>>>>          4.0 = freq, occurrences of term within document
>>>>>          1.2 = k1, term saturation parameter
>>>>>          0.75 = b, length normalization parameter
>>>>>          5656.0 = dl, length of field (approximate)
>>>>>          50520.543 = avgdl, average length of field
>>>>>  0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
>>>>>    0.0 = int(s_integer_search.previews)=0
>>>>>    1.0 = boost
>>>>>  0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
>>>>>    0.0 = int(s_integer_search.downloads)=0
>>>>>    1.0 = boost
>>>>> "
>>>>> 
>>>>> *Doc2*
>>>>> "80302a1ecc44d1e556970ab96c25b1fd3328a854": "
>>>>> 84.61461 = sum of:
>>>>>  84.61461 = sum of:
>>>>>    64.68881 = weight(stemmed_data.timenote.narratives:remedi in 0)
>>>>> [SchemaSimilarity], result of:
>>>>>      64.68881 = score(freq=493.0), computed as boost * idf * tf from:
>>>>>        100.0 = boost
>>>>>        0.65094686 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>>> from:
>>>>>          *60 = n, number of documents containing term*
>>>>>          *115 = N, total number of documents with field*
>>>>>        0.99376476 = tf, computed as freq / (freq + k1 * (1 - b + b *
>>>>> dl /
>>>>> avgdl)) from:
>>>>>          493.0 = freq, occurrences of term within document
>>>>>          1.2 = k1, term saturation parameter
>>>>>          0.75 = b, length normalization parameter
>>>>>          229400.0 = dl, length of field (approximate)
>>>>>          73913.91 = avgdl, average length of field
>>>>>    19.9258 = weight(stemmed_data.timenote.matters:remedi in 0)
>>>>> [SchemaSimilarity], result of:
>>>>>      19.9258 = score(freq=340.0), computed as boost * idf * tf from:
>>>>>        10.0 = boost
>>>>>        2.0024805 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5))
>>>>> from:
>>>>>          *13 = n, number of documents containing term*
>>>>>          *99 = N, total number of documents with field*
>>>>>        0.99505585 = tf, computed as freq / (freq + k1 * (1 - b + b *
>>>>> dl /
>>>>> avgdl)) from:
>>>>>          340.0 = freq, occurrences of term within document
>>>>>          1.2 = k1, term saturation parameter
>>>>>          0.75 = b, length normalization parameter
>>>>>          147480.0 = dl, length of field (approximate)
>>>>>          95534.95 = avgdl, average length of field
>>>>>  0.0 = FunctionQuery(int(s_integer_search.previews)), product of:
>>>>>    0.0 = int(s_integer_search.previews)=0
>>>>>    1.0 = boost
>>>>>  0.0 = FunctionQuery(int(s_integer_search.downloads)), product of:
>>>>>    0.0 = int(s_integer_search.downloads)=0
>>>>>    1.0 = boost
>>>>> "
>>>>> 
>>>>

Re: Relevancy debugging - idf score

Reply via email to