from:"danield"

Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield

Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms. 

As an example, for 2 documents with fields field1 and field2, where 
query1=”field1:term1”
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that 

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?

Cheers,
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield

Corrections: 

document2={field1:”term1”, field2:”term1”} 
Coord(query1,document2)= 1/1 = 1

(Doesn't affect the problem/observation)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield

Hi Mike,

Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a different term.

Similarily, I could use regex queries, but again that doesn't fix the TF
issue. I am not talking here hypothetically, I have proof this doesn't work
experimentally (i.e. the precision for my task goes down in my experiments).

Also, I agree that when your fields are essentially different as in /title/,
/author /and /text/, normalizing by field length makes sense, but in my case
my fields are many and are all chunks of a larger text (extracted sentences
that have been labelled with a number of different classes), and in the
experiments I am running I am trying to establish whether weighting
sentences in different classes differently will lead to increased relevance
of results.

This also doesn't change the fact that documentation is wrong! Any ideas how
to fix?
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield

Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that
explanation more prominent, as I clearly missed it.

Never mind, I am working on my own solution for this, through subclassing
QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other
classes.

Cheers,
Daniel




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179851.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-19 Thread danield

Update: I have implemented my own subclasses of QueryParser, BooleanQuery,
BooleanScorer and Similarity to deal with this.

I have been successful in getting the exact behaviour I want... when
calling the .explain() method. However, the scores for some documents often
differ when calling IndexSearcher.search() vs IndexSearcher.explain().

I am a bit confused by this. The coord() seems to be one of the things I
need to change, but is not the only element in the formula that I have
clearly changed for the .explain() pipeline but not for .search().

The implementation of BulkScorer remains perplexing to me and I suspect it
is something in there I have missed. Any pointers?

Thanks!
Daniel


On 15 January 2015 at 23:00, Jack Krupansky-3 [via Lucene] <
ml-node+s472066n4179925...@n3.nabble.com> wrote:

> File a Jira for this particular doc fix since it is significant and not
> just mere worksmithing. Better yet, submit a patch since that's Javadoc,
> although the exact form of the doc fix might be debatable, so I general
> description of the problem should be sufficient, unless you feel
> motivated.
>
> -- Jack Krupansky
>
> On Thu, Jan 15, 2015 at 11:23 AM, danield <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=0>> wrote:
>
> > Hi Mike,
> >
> > Thank you for your reply. Yes, I had thought of this, but it is not a
> > solution to my problem, and this is because the Term Frequency and
> > therefore
> > the results will still be wrong, as prepending or appending a string to
> the
> > term will still make it a different term.
> >
> > Similarily, I could use regex queries, but again that doesn't fix the TF
> > issue. I am not talking here hypothetically, I have proof this doesn't
> work
> > experimentally (i.e. the precision for my task goes down in my
> > experiments).
> >
> > Also, I agree that when your fields are essentially different as in
> > /title/,
> > /author /and /text/, normalizing by field length makes sense, but in my
> > case
> > my fields are many and are all chunks of a larger text (extracted
> sentences
> > that have been labelled with a number of different classes), and in the
> > experiments I am running I am trying to establish whether weighting
> > sentences in different classes differently will lead to increased
> relevance
> > of results.
> >
> > This also doesn't change the fact that documentation is wrong! Any ideas
> > how
> > to fix?
> > Daniel
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=1>
> > For additional commands, e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=4179925&i=2>
> >
> >
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179925.html
>  To unsubscribe from Similarity formula documentation is misleading + how
> to make field-agnostic queries?, click here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4179307&code=ZGFuaWVsZHVtYUBnbWFpbC5jb218NDE3OTMwN3wxMjkzMjkwMDg3>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4180529.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

BulkScorer and .explain() compute scores separately?

2015-02-10 Thread danield

I have subclassed the BooleanQuery and changed the BooleanWeight constructor
to change the way the /coord/ and /idf /components of the similiarity
formula are computed, and my changes work as expected when calling
IndexSearcher.explain().

However, I now find that when just calling IndexSearcher.search(), the
scores reported for each document and resulting ranking are quite different
from what .explain() shows me.

What is going on? Clearly scores are computed somewhere else when done by
BulkScorer and not in BooleanQuery.BooleanWeight(). 

I have been looking at the code but it's mighty confusing and I still
haven't figured out how to make the same changes on this pipeline.

Please help!!
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/BulkScorer-and-explain-compute-scores-separately-tp4185544.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Similarity formula documentation is misleading + how to make field-agnostic queries?

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

BulkScorer and .explain() compute scores separately?

6 matches

Site Navigation

Mail list logo

Footer information