Similarity formula documentation is misleading + how to make field-agnostic queries?
Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf Term Frequency (TF) counts are expected to be per-document in the IR literature, and this documentation doesn’t say any differently. However, it turns out that for Lucene, TF scores are in fact PER-FIELD. This furthermore applies to the /coord/ component. I realise that /coord/ is a ratio of query terms matched over total query terms, but I believe an effort could be made to make clear that field1:term1 and field2:term1 count as 2 different query terms. As an example, for 2 documents with fields field1 and field2, where query1=”field1:term1” query2=”field1:term1 or field2:term1” document1={field1:”term1 term1”, field2:””} document2={field2:”term1”, field2:”term1”} Coord(query1,document1)= 1/1 = 1 Coord(query2,document1)= 1/2 = 0.5 Coord(query1,document2)= 1/2 = 0.5 Coord(query2,document2)= 2/2 = 1 Now, the TF scores will be normalized with the fieldNorm component which is computed based on field length at indexing time and stored in a single byte, with a significant loss of precision. These things together make it impossible to run Lucene retrieval in such a way that *similarity(query2,document1) == similarity(query2,document2)* which is precisely what I need in my use case. Here are my questions: 1. I think the documentation should be updated to make this clear! Can I do this myself? 2. Has anyone encountered this problem before? Is there an easy fix? Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Corrections: document2={field1:”term1”, field2:”term1”} Coord(query1,document2)= 1/1 = 1 (Doesn't affect the problem/observation) -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179370.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could use regex queries, but again that doesn't fix the TF issue. I am not talking here hypothetically, I have proof this doesn't work experimentally (i.e. the precision for my task goes down in my experiments). Also, I agree that when your fields are essentially different as in /title/, /author /and /text/, normalizing by field length makes sense, but in my case my fields are many and are all chunks of a larger text (extracted sentences that have been labelled with a number of different classes), and in the experiments I am running I am trying to establish whether weighting sentences in different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that explanation more prominent, as I clearly missed it. Never mind, I am working on my own solution for this, through subclassing QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other classes. Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179851.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Update: I have implemented my own subclasses of QueryParser, BooleanQuery, BooleanScorer and Similarity to deal with this. I have been successful in getting the exact behaviour I want... when calling the .explain() method. However, the scores for some documents often differ when calling IndexSearcher.search() vs IndexSearcher.explain(). I am a bit confused by this. The coord() seems to be one of the things I need to change, but is not the only element in the formula that I have clearly changed for the .explain() pipeline but not for .search(). The implementation of BulkScorer remains perplexing to me and I suspect it is something in there I have missed. Any pointers? Thanks! Daniel On 15 January 2015 at 23:00, Jack Krupansky-3 [via Lucene] < ml-node+s472066n4179925...@n3.nabble.com> wrote: > File a Jira for this particular doc fix since it is significant and not > just mere worksmithing. Better yet, submit a patch since that's Javadoc, > although the exact form of the doc fix might be debatable, so I general > description of the problem should be sufficient, unless you feel > motivated. > > -- Jack Krupansky > > On Thu, Jan 15, 2015 at 11:23 AM, danield <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=4179925&i=0>> wrote: > > > Hi Mike, > > > > Thank you for your reply. Yes, I had thought of this, but it is not a > > solution to my problem, and this is because the Term Frequency and > > therefore > > the results will still be wrong, as prepending or appending a string to > the > > term will still make it a different term. > > > > Similarily, I could use regex queries, but again that doesn't fix the TF > > issue. I am not talking here hypothetically, I have proof this doesn't > work > > experimentally (i.e. the precision for my task goes down in my > > experiments). > > > > Also, I agree that when your fields are essentially different as in > > /title/, > > /author /and /text/, normalizing by field length makes sense, but in my > > case > > my fields are many and are all chunks of a larger text (extracted > sentences > > that have been labelled with a number of different classes), and in the > > experiments I am running I am trying to establish whether weighting > > sentences in different classes differently will lead to increased > relevance > > of results. > > > > This also doesn't change the fact that documentation is wrong! Any ideas > > how > > to fix? > > Daniel > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=4179925&i=1> > > For additional commands, e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=4179925&i=2> > > > > > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179925.html > To unsubscribe from Similarity formula documentation is misleading + how > to make field-agnostic queries?, click here > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4179307&code=ZGFuaWVsZHVtYUBnbWFpbC5jb218NDE3OTMwN3wxMjkzMjkwMDg3> > . > NAML > <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4180529.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
BulkScorer and .explain() compute scores separately?
I have subclassed the BooleanQuery and changed the BooleanWeight constructor to change the way the /coord/ and /idf /components of the similiarity formula are computed, and my changes work as expected when calling IndexSearcher.explain(). However, I now find that when just calling IndexSearcher.search(), the scores reported for each document and resulting ranking are quite different from what .explain() shows me. What is going on? Clearly scores are computed somewhere else when done by BulkScorer and not in BooleanQuery.BooleanWeight(). I have been looking at the code but it's mighty confusing and I still haven't figured out how to make the same changes on this pipeline. Please help!! Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/BulkScorer-and-explain-compute-scores-separately-tp4185544.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org