Hi, Question about Payload Query and Document Boosts. We are using Lucene 3.2 and Payload queries, with our own PayloadSimilarity class which overrides the scorePayload method like so:
{code} @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { if (payload != null) { return PayloadHelper.decodeFloat(payload, offset); } else { return 1.0F; } } {/code} We are injecting payloads as ID$SCORE pairs using the DelimitedPayloadTokenFilter and life was good - when we run PayloadTermQuery() the scores came back as our score. I have included code below that illustrates the calling pattern, its this: {code} PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p", "2790926"), new AveragePayloadFunction(), false); {/code} ie, do not include the span score (the SCORE is calculated as a result of offline processing and we can't change that value). Now we would like to boost each document differently (index time, document.setBoost(boost), based on its content type), and we are running into problems. Looks like the document boost is not applied to the document score during search if includeSpanScore==false. When we set it to true, we see a difference in scores (the original score without document boosts is multiplied by the document boost set), but the original scores without boost is not the same as SCORE, ie its now affected by the span score. My question is - is there some method in DefaultSimilarity that I can override so that my score is my original SCORE * document boost? The Similarity documentation does not provide any clues to my problem - I tried modifying the computeNorm() method to return state.getBoost() but it looks like its never called. If not, the other option would be to bake in the doc boost into the SCORE value, by multiplying them on their way into lucene, so that now SCORE *= doc boost. Here is my unit test which illustrates the issue: {code} import java.io.Reader; import java.util.HashMap; import java.util.Map; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.PerFieldAnalyzerWrapper; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceTokenizer; import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter; import org.apache.lucene.analysis.payloads.FloatEncoder; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.payloads.AveragePayloadFunction; import org.apache.lucene.search.payloads.PayloadTermQuery; import org.apache.lucene.store.RAMDirectory; import org.junit.Test; import com.healthline.query.kb.ConceptAnalyzer; import com.healthline.solr.HlSolrConstants; import com.healthline.solr.search.PayloadSimilarity; import com.healthline.util.Config; public class DocBoostTest { private class PayloadAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream tokens = new WhitespaceTokenizer(HlSolrConstants.CURRENT_VERSION, reader); tokens = new DelimitedPayloadTokenFilter(tokens, '$', new FloatEncoder()); return tokens; } }; private Analyzer getAnalyzer() { Map<String,Analyzer> pfas = new HashMap<String,Analyzer>(); pfas.put("imuids_p", new PayloadAnalyzer()); PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper( new ConceptAnalyzer(), pfas); return analyzer; } private IndexSearcher loadTestData(boolean setBoosts) throws Exception { RAMDirectory ramdir = new RAMDirectory(); IndexWriterConfig iwconf = new IndexWriterConfig( HlSolrConstants.CURRENT_VERSION, getAnalyzer()); iwconf.setOpenMode(OpenMode.CREATE); IndexWriter writer = new IndexWriter(ramdir, iwconf); Document doc1 = new Document(); doc1.add(new Field("itemtitle", "Cancer and the Nervous System PARANEOPLASTIC DISORDERS", Store.YES, Index.ANALYZED)); doc1.add(new Field("imuids_p", "2790917$52.01 2790926$53.18", Store.YES, Index.ANALYZED)); doc1.add(new Field("contenttype", "BK", Store.YES, Index.NOT_ANALYZED)); if (setBoosts) doc1.setBoost(1.2F); writer.addDocument(doc1); Document doc2 = new Document(); doc2.add(new Field("itemtitle", "Esophagogastric cancer: Targeted agents", Store.YES, Index.ANALYZED)); doc2.add(new Field("imuids_p", "2790926$52.18 2790981$5.19", Store.YES, Index.ANALYZED)); doc2.add(new Field("contenttype", "JL", Store.YES, Index.NOT_ANALYZED)); if (setBoosts) doc2.setBoost(1.5F); writer.addDocument(doc2); writer.commit(); writer.close(); return new IndexSearcher(ramdir); } @Test public void testConceptScoringWithoutBoost() throws Exception { Config.setConfigDir("/prod/web/config"); IndexSearcher searcher = loadTestData(false); searcher.setSimilarity(new PayloadSimilarity()); PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p", "2790926"), new AveragePayloadFunction(), false); ScoreDoc[] hits = searcher.search(q, 10).scoreDocs; System.out.println("Concept result without boosting"); for (int i = 0; i < hits.length; i++) { Document doc = searcher.doc(hits[i].doc); String contentType = doc.get("contenttype"); String title = doc.get("itemtitle"); System.out.println(hits[i].doc + ": " + title + "/" + contentType + " (" + hits[i].score + ")"); } searcher.close(); } @Test public void testConceptScoringWithContentTypeBoost() throws Exception { Config.setConfigDir("/prod/web/config"); IndexSearcher searcher = loadTestData(true); searcher.setSimilarity(new PayloadSimilarity()); PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p", "2790926"), new AveragePayloadFunction(), false); ScoreDoc[] hits = searcher.search(q, 10).scoreDocs; System.out.println("Concept result with boosting"); for (int i = 0; i < hits.length; i++) { Document doc = searcher.doc(hits[i].doc); String contentType = doc.get("contenttype"); String title = doc.get("itemtitle"); System.out.println(hits[i].doc + ": " + title + "/" + contentType + " (" + hits[i].score + ")"); } searcher.close(); } @Test public void testFulltextScoringWithoutBoost() throws Exception { Config.setConfigDir("/prod/web/config"); IndexSearcher searcher = loadTestData(false); QueryParser parser = new QueryParser(HlSolrConstants.CURRENT_VERSION, "itemtitle", getAnalyzer()); Query q = parser.parse("cancer"); ScoreDoc[] hits = searcher.search(q, 10).scoreDocs; System.out.println("Fulltext result without boosting"); for (int i = 0; i < hits.length; i++) { Document doc = searcher.doc(hits[i].doc); String contentType = doc.get("contenttype"); String title = doc.get("itemtitle"); System.out.println(hits[i].doc + ": " + title + "/" + contentType + " (" + hits[i].score + ")"); } searcher.close(); } @Test public void testFulltextScoringWithContentTypeBoost() throws Exception { Config.setConfigDir("/prod/web/config"); IndexSearcher searcher = loadTestData(true); QueryParser parser = new QueryParser(HlSolrConstants.CURRENT_VERSION, "itemtitle", getAnalyzer()); Query q = parser.parse("cancer"); ScoreDoc[] hits = searcher.search(q, 10).scoreDocs; System.out.println("Fulltext result with boosting"); for (int i = 0; i < hits.length; i++) { Document doc = searcher.doc(hits[i].doc); String contentType = doc.get("contenttype"); String title = doc.get("itemtitle"); System.out.println(hits[i].doc + ": " + title + "/" + contentType + " (" + hits[i].score + ")"); } searcher.close(); } } {/code} With the includeSpanScore==false, I get the following results from this unit test. The scores are the same as what I put in, but document boost has no effect. {code} [junit] ------------- Standard Output --------------- [junit] Concept result without boosting [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (53.18) [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18) [junit] Concept result with boosting [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (53.18) [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18) [junit] Fulltext result without boosting [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674) [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (0.26010898) [junit] Fulltext result with boosting [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011) [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (0.2972674) [junit] ------------- ---------------- --------------- {/code} and with includeSpanScores==true, I get the following results. This time, the doc boosts do affect the payload query scores, but the original scores (before boosting) is different from the score pair I put in. {code} [junit] Concept result without boosting [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (13.973032) [junit] 1: Esophagogastric cancer: Targeted agents/JL (13.710282) [junit] Concept result with boosting [junit] 1: Esophagogastric cancer: Targeted agents/JL (21.936451) [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (16.767637) [junit] Fulltext result without boosting [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674) [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (0.26010898) [junit] Fulltext result with boosting [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011) [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK (0.2972674) [junit] ------------- ---------------- --------------- {/code} TIA for any help you can provide. -sujit --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org