Yes, this is more or less what I had in mind. However, for this approach one requires some *prior knowledge* of the vocabulary of the document (or the collection) to produce that score before even it gets analyzed, isn't it? And this is the paradox that I have been thinking. If you have that knowledge, that's fine. In addition, for applications that only require a small term window to generate a score (such as term in context score) this can be implemented very easy.
It is possible to inject the document dependent boost/score generation *logic* (an interface would do) to the Tokenizer/TokenStream. However, I am afraid this may have an indexing time penalty. If your window size is the document itself, you will be doing the same job twice (calculating the num of times a term occurs in doc X, index time weights etc.). IndexWriter already does these somewhere down deep. Simply put, I want to add some scores to documents/terms, but I can't generate that score before I observe the document/terms. If I do that I would replicate some of the work that is being already done by IndexWriter. If I remember it correctly, there is also some intention to add document payloads functionality. I have the same concerns on this. So I think we need a clear view on the topic. Where is the payload work moving? How we can generate a score without duplicating some of the work that IndexWriter is doing? I guess Michael Busch is working on document payloads for release 3.0. I would appreciate if someone can enlighten us on how that would work and can be utilised, in particularly during the analysis phase? Cheers, Murat > Thanks, Murat. > It was very useful - I also tried to override IndexWriter and > DocumentsWriter instead, but it didn't work well. DocumentsWriter can't > be > overriden. > > So, I didn't find a better way to make the changes. > > My needs are having for every term in different documents different > values. > So, like you set the boost at the document level, I would like to set the > boost for different terms within differnt documents. > > For that matter, I made some changes in the code you sent - (I coloured > the > changes in red): > > Below you can find an example for the use of it > > ********** > private class PayloadAnalyzer extends Analyzer > { > private PayloadTokenStream payToken = null; > private int score; > *private Map<String, Integer> scoresMap = new HashMap<String, > Integer>();* > public synchronized void setScore(int s) > { > score = s; > } > * public synchronized void setMapScores(Map<String, Integer> scoresMap) > { > this.scoresMap = scoresMap; > }* > public final TokenStream tokenStream(String field, Reader reader) > { > payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader)); > //new > LowerCaseTokenizer(reader)); > payToken.setScore(score); > payToken.setMapScores(scoresMap); > return payToken; > } > } > private class PayloadTokenStream extends TokenStream > { > private Tokenizer tok = null; > private int score; > *private Map<String, Integer> scoresMap = new HashMap<String, > Integer>();* > public PayloadTokenStream(Tokenizer tokenizer) > { > tok = tokenizer; > } > public void setScore(int s) > { > score = s; > } > * public synchronized void setMapScores(Map<String, Integer> scoresMap) > { > this.scoresMap = scoresMap; > }* > public Token next(Token t) throws IOException > { > t = tok.next(t); > if(t != null) > { > //t.setTermBuffer("can change"); > //Do something with the data > byte[] bytes = ("score:" + score).getBytes(); > // t.setPayload(new Payload(bytes)); > * String word = String.copyValueOf(t.termBuffer(), 0, t.termLength()); > int score = scoresMap.get(word); > byte payLoad = Byte.parseByte(String.valueOf(score)); > t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));* > } > return t; > } > public void reset(Reader input) throws IOException > { > tok.reset(input); > } > public void close() throws IOException > { > tok.close(); > } > } > ********************************** > *Example for the use of payloads:* > > PayloadAnalyzer panalyzer = new PayloadAnalyzer(); > File index = new File("" + "TestSearchIndex"); > IndexWriter iwriter = new IndexWriter(index, panalyzer); > Document d = new Document(); > d.add(new Field("text", "word1 word2 word3", Field.Store.YES, > Field.Index.TOKENIZED, Field.TermVector.YES)); > d.add(new Field("id", "1^3", Field.Store.YES, Field.Index.UN_TOKENIZED, > Field.TermVector.NO)); > Map<String, Integer> mapScores = new HashMap<String, Integer>(); > mapScores.put("word1", 3); > mapScores.put("word2", 1); > mapScores.put("word3", 1); > panalyzer.setMapScores(mapScores); > iwriter.addDocument(d, panalyzer); > d = new Document(); > d.add(new Field("text", "word1 word2 word3", Field.Store.YES, > Field.Index.TOKENIZED, Field.TermVector.YES)); > d.add(new Field("id", "2^3", Field.Store.YES, Field.Index.UN_TOKENIZED, > Field.TermVector.NO)); > //We set the score for the term of the document that will be > analyzed. > /*I was worried about this part - document dependent score > which may be utilized*/ > mapScores = new HashMap<String, Integer>(); > mapScores.put("word1", 1); > mapScores.put("word2", 3); > mapScores.put("word3", 1); > panalyzer.setMapScores(mapScores); > iwriter.addDocument(d, panalyzer); > /*-----------------*/ > // iwriter.commit(); > iwriter.optimize(); > iwriter.close(); > BooleanQuery bq = new BooleanQuery(); > BoostingTermQuery tq = new BoostingTermQuery(new Term("text", "word1")); > tq.setBoost((float) 1.0); > bq.add(tq, BooleanClause.Occur.MUST); > tq = new BoostingTermQuery(new Term("text", "word2")); > tq.setBoost((float) 3); > bq.add(tq, BooleanClause.Occur.SHOULD); > tq = new BoostingTermQuery(new Term("text", "word3")); > tq.setBoost((float) 1); > bq.add(tq, BooleanClause.Occur.SHOULD); > IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex"); > searcher1.setSimilarity(new WordsSimilarity()); > TopDocs topDocs = searcher1.search(bq, null, 3); > Hits hits1 = searcher1.search(bq); > for(int j = 0; j < hits1.length(); j++) > { > Explanation explanation = searcher1.explain(bq, j); > System.out.println("**** " + hits1.score(j) + " " + > hits1.doc(j).getField("id").stringValue() + " *****"); > System.out.println(explanation.toString()); > explanation.getValue(); > > System.out.println("********************************************************"); > System.out.println("Score " + topDocs.scoreDocs[j].score + " doc " + > searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue()); > > System.out.println("********************************************************"); > } > > If you try the same query with differnt boosting, you will get a different > order for the documents. > > Does it look ok? > > Thanks again! > Liat > 2009/4/25 Murat Yakici <murat.yak...@cis.strath.ac.uk> > >> >> >> Here is what I am doing, not so magical... There are two classes, an >> analyzer and an a TokenStream in which I can inject my document >> dependent >> data to be stored as payload. >> >> >> private PayloadAnalyzer panalyzer = new PayloadAnalyzer(); >> >> private class PayloadAnalyzer extends Analyzer { >> >> private PayloadTokenStream payToken = null; >> private int score; >> >> public synchronized void setScore(int s) { >> score=s; >> } >> >> public final TokenStream tokenStream(String field, Reader reader) { >> payToken = new PayloadTokenStream(new >> LowerCaseTokenizer(reader)); >> payToken.setScore(score); >> return payToken; >> } >> } >> >> private class PayloadTokenStream extends TokenStream { >> >> private Tokenizer tok = null; >> private int score; >> >> public PayloadTokenStream(Tokenizer tokenizer) { >> tok = tokenizer; >> } >> >> public void setScore(int s) { >> score = s; >> } >> >> public Token next(Token t) throws IOException { >> t = tok.next(t); >> if (t != null) { >> //t.setTermBuffer("can change"); >> //Do something with the data >> byte[] bytes = ("score:"+ score).getBytes(); >> t.setPayload(new Payload(bytes)); >> } >> return t; >> } >> >> public void reset(Reader input) throws IOException { >> tok.reset(input); >> } >> >> public void close() throws IOException { >> tok.close(); >> } >> } >> >> >> public void doIndex() { >> try { >> File index = new File("./TestPayloadIndex"); >> IndexWriter iwriter = new IndexWriter(index, >> panalyzer, >> IndexWriter.MaxFieldLength.UNLIMITED); >> >> Document d = new Document(); >> d.add(new Field("content", >> "Everyone, someone, myTerm, yourTerm", Field.Store.YES, >> Field.Index.ANALYZED, Field.TermVector.YES)); >> //We set the score for the term of the document that will be >> analyzed. >> /*I was worried about this part - document dependent score >> which may be utilized*/ >> panalyzer.setScore(5); >> iwriter.addDocument(d, panalyzer); >> /*-----------------*/ >> ... >> iwriter.commit(); >> iwriter.optimize(); >> iwriter.close(); >> >> //Now read the index >> IndexReader ireader = IndexReader.open(index); >> TermPositions tpos = ireader.termPositions( >> new Term("content","myterm"));//Note >> LowercaseTokenizer >> while (tpos.next()) { >> int pos; >> for(int i=0;i<tpos.freq();i++){ >> pos=tpos.nextPosition(); >> if (tpos.isPayloadAvailable()) { >> byte[] data = new byte[tpos.getPayloadLength()]; >> tpos.getPayload(data, 0); >> //Utilise payloads; >> } >> } >> } >> >> tpos.close(); >> } catch (CorruptIndexException ex) { >> // >> } catch (LockObtainFailedException ex) { >> // >> } catch (IOException ex) { >> // >> } >> } >> >> I wish it was designed better... Please let me know if you guys have a >> better idea. >> >> Cheers, >> Murat >> >> > Dear Murat, >> > >> > I saw your question and wondered how did you implement these changes? >> > The requirement below are the same ones as I am trying to code now. >> > Did you modify the source code itself or only used Lucene's jar and >> just >> > override code? >> > >> > I would very much apprecicate if you could give me a short explanation >> on >> > how was it done. >> > >> > Thanks a lot, >> > Liat >> > >> > 2009/4/21 Murat Yakici <murat.yak...@cis.strath.ac.uk> >> > >> >> Hi, >> >> I started playing with the experimental payload functionality. I have >> >> written an analyzer which adds a payload (some sort of a score/boost) >> >> for >> >> each term occurance. The payload/score for each term is dependent on >> the >> >> document that the term comes from (I guess this is the typoical use >> >> case). >> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The >> >> parameter >> >> for calculating the payload changes after each >> >> indexWriter.addDocument(..) >> >> method call in a while loop. I am assuming that the >> >> indexWriter.addDocument(..) methods are thread safe. Can I confirm >> this? >> >> >> >> Cheers, >> >> >> >> -- >> >> Murat Yakici >> >> Department of Computer & Information Sciences >> >> University of Strathclyde >> >> Glasgow, UK >> >> ------------------------------------------- >> >> The University of Strathclyde is a charitable body, registered in >> >> Scotland, >> >> with registration number SC015263. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> >> >> Murat Yakici >> Department of Computer & Information Sciences >> University of Strathclyde >> Glasgow, UK >> ------------------------------------------- >> The University of Strathclyde is a charitable body, registered in >> Scotland, >> with registration number SC015263. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > Murat Yakici Department of Computer & Information Sciences University of Strathclyde Glasgow, UK ------------------------------------------- The University of Strathclyde is a charitable body, registered in Scotland, with registration number SC015263. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org