[jira] [Commented] (LUCENE-8216) Better cross-field scoring

Adrien Grand (JIRA) Mon, 19 Nov 2018 05:50:09 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691717#comment-16691717
 ]


Adrien Grand commented on LUCENE-8216:
--------------------------------------

Woohoo. +1 to start with a dedicated query and then look into folding this into 
similarities instead, adding built-in support for this in query parsers, etc. 
This is targeting sandbox, so I have no problem merging it as-is and iterating 
from there. Some thoughts that I had while reviewing the patch:
 - applyWeight casts to an int, should it keep a float instead? The similarity 
API already allows to pass term frequencies as a float. That would avoid the 
issue that you could otherwise end up with a term frequency that is equal to 0, 
which is illegal.
 - Creating a FilterLeafReader feels a bit heavy given that everything that 
LeafSimScorer does with it is pulling norms. Forking LeafSimScorer might make 
things a bit easier (and later maybe refactoring LeafSimScorer)?
 - It looks like removing `field` from CollectionStatistics would help support 
such a change without having to use fake field names.
 - Maybe advanceExact on merged norms should return `value != 0` rather than 
true all the time? I know it shouldn't be an issue in practice since we only 
get the norm on fields that have a value when scoring, but I would like it 
better if it behaved correctly. Also nextDoc() looks wrong too as I don't think 
it would skip over documents that don't have a value or return NO_MORE_DOCS 
when maxDoc is reached?

bq. Norms are also summed per document but since they represent the number of 
unique words we could also take the max.

Hmm I don't think this is correct. If a field value consists of twice the same 
term then the length of the field will be 2. Maybe you got confused because the 
length discards synonyms so that if two terms occur at the same position then 
this only adds one to the length by default. Summing up lengths sounds like a 
sensible approach to me.

> Better cross-field scoring
> --------------------------
>
>                 Key: LUCENE-8216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8216
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Major
>             Fix For: master (8.0)
>
>         Attachments: LUCENE-8216.patch
>
>
> I'd like Lucene to have better support for scoring across multiple fields. 
> Today we have BlendedTermQuery which tries to help there but it probably 
> tries to do too much on some aspects (handling cross-field term queries AND 
> synonyms) and too little on other ones (it tries to merge index-level 
> statistics, but not per-document statistics like tf and norm).
> Maybe we could implement something like BM25F so that queries across multiple 
> fields would retain the benefits of BM25 like the fact that the impact of the 
> term frequency saturates quickly, which is not the case with BlendedTermQuery 
> if you have occurrences across many fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8216) Better cross-field scoring

Reply via email to