Thanks Eran, I tried it, adding the classes I copied below and tried to run the following code:
[Also I have below a question about the usage of synonyms and BooleanQuery.] DoubleMap wordMap = new DoubleMap(); wordMap.insert("1", 1, 5); // for word "1" we have the world 1, 5 times wordMap.insert("1", 2, 2);// for word "1" we have the world 2, 2 times wordMap.insert("1", 3, 7); wordMap.insert("1", 4, 1); wordMap.insert("2", 3, 1); // for word "2" we have the world 3, 1 time wordMap.insert("2", 5, 1); wordMap.insert("2", 6, 1); wordMap.insert("3", 3, 1); wordMap.insert("3", 4, 1); wordMap.insert("3", 8, 1); ioManager io = new ioManager(); io.index(wordMap, "TestSearchIndex", "", "1"); IndexSearcher searcher = new IndexSearcher("TestSearchIndex"); searcher.setSimilarity(new WordsSimilarity()); // WordsSimilarity is written below Query btq = new BoostingTermQuery(new Term(WordIndex.FIELD_WORLDS, "3 3 2 1")); Hits wordsHits = searcher.search(btq); >From some reason the hits size is 0 and none of the methods overriden in WordsSimilarity is called (I put a breakpoint and it didn;t get there during search time) public class *WordsAnalyzer* extends Analyzer { public Map<String, Map<String, Integer>> wordsWorldsFreq = new HashMap<String, Map<String, Integer>>(); public Map<String, Integer> worldsFreq = new HashMap<String, Integer>(); public WordsAnalyzer() { } public WordsAnalyzer(Map<String, Integer> worldsFreq) throws IOException { this.worldsFreq = worldsFreq; } public TokenStream tokenStream(String fieldName, Reader reader) { return new WordsFilter(new StandardTokenizer(reader), worldsFreq); } } public class *WordsFilter* extends TokenFilter { public Map<String, Integer> worldsFreq; public WordsFilter(TokenStream in, Map<String, Integer> worldsFreq) { super(in); this.worldsFreq = worldsFreq; } public final Token next(Token result) throws IOException { byte payLoad = 1; try { result = input.next(result); if(result != null) { String word = String.copyValueOf(result.termBuffer(), 0, result.termLength()); payLoad = Byte.parseByte(worldsFreq.get(word).toString()); result.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) })); return result; } else { return null; } } catch(Exception e) { e.printStackTrace(); System.out.println(result.termBuffer() + " " + payLoad); FileUtil.writeToFile("IndexProblems.txt", "WordsFilter problem for " + result.termBuffer() + " " + payLoad + " : " + e.getStackTrace()); return null; } } } ***** public class *WordsSimilarity* extends DefaultSimilarity { public WordsSimilarity() { } public float tf(float freq) { return super.tf(freq); // just wanted to check whether it is called } public float scorePayload(byte[] payload, int offset, int length) { // if(length == 1) // { return payload[offset]; // } } } ** ******* ************ For the synonyms with the weights, I tried the following code: BooleanQuery bq = new BooleanQuery(); TermQuery tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "3")); tq.setBoost((float) 1.0); bq.add(bq, BooleanClause.Occur.MUST); tq = new TermQuery(new Term(WordIndex.FIELD_WORLDS, "2")); tq.setBoost((float) 0.5); bq.add(bq, BooleanClause.Occur.SHOULD); IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex"); Hits hits1 = searcher1.search(bq); And got the error: any idea what is the problem? at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:385) Process exited. Thanks, Liat 2009/4/21 Eran Sevi <erans...@gmail.com> > Hi, > > You might want to take a look at Payloads. If you know the frequency of the > words in each world in advance than during tokenization for each world you > could save the frequency as the payload. > > During searches you could use BoostingTermQuery to take the frequency into > account. > > Eran. > On Tue, Apr 21, 2009 at 4:44 PM, liat oren <oren.l...@gmail.com> wrote: > > > Hi Doron, > > > > Thank you very much for the elaborated answer! > > > > About the Synonyms, I can't use Wordnet as I have my own list of > synonyms. > > I > > will look at contrib/memory and see what it does. > > > > You understood correctly the process of using the inverse doc. About the > > two > > problems you mentioned: scalability and ignoring the vicinity of words - > > scalability - this is the reason I wanted to set the frequencies of the > > terms. The use of the frequencies will be used at this stage, not at the > > stage of using the synonyms. When I use the sysnonyms, I want to use the > > score as you suggested below. > > Here, I have for every word, in which worlds they appear. Currently every > > world appears once in a word. However, I would like it to appear the > number > > if times as the frequency of the word in the world. In order to avoid > > writing the world several times in the world field, I would like to be > able > > to set the freq of the specific world accordng to the freq of the word at > > this world without actually writing it x times (for scalability and index > > size and performance issues) > > So if dog appears 10 times in world 1 and 5 times in world 2, and cat > > appears 5 times in world 1, then I want these frequencies to be taken > into > > account when computing how the word dog and cat are close. BUT I don't > want > > to write world 1 10 times in word dog and 5 times in word cat, but only > > once > > and to update the termVector so that the frequency will get 10 and 5 > > respectively. > > So the *generation* of the synonyms will take into account the > frequencies > > > > The vicinity of words - is there any better way to take it in account? > > > > About the suggestion of using term boosting that will use the score of > the > > synonyms - if I want to query "big white dogs" and I have the following > > synonyms: > > big - big (1.0), large (0.9), huge (0.6) > > white - white (1.0), color (0.5) offwhite (0.8) > > dog - dog (1.0) > > So this is the way to do it? : > > > > BooleanQuery bq = new BooleanQuery(); > > TermQuery tq = new TermQuery(new Term("text", "big")); > > tq.setBoost((float)1.0); > > bq.add(bq, false, false); > > tq = new TermQuery(new Term(("text", "large")); > > tq.setBoost((float)0.9); > > bq.add(bq, false); > > tq = new TermQuery(new Term(("text", "huge")); > > tq.setBoost((float)0.6); > > bq.add(bq, false); > > > > tq = new TermQuery(new Term(("text", "white")); > > tq.setBoost((float)1.0); > > bq.add(bq, false); > > tq = new TermQuery(new Term(("text", "color")); > > tq.setBoost((float)0.5); > > bq.add(bq, false); > > // etc > > IndexSearcher searcher = new IndexSearcher("TestSearchIndex"); > > Hits hits = searcher.search(bq); > > > > > > how the use of booleanQuery will also look at the position of the words? > I > > remember I read about the score that takes into account also the > position > > of the term, but I didn't see this factor in the score formula > > Thanks again, it is very helpful, > > Liat > > 2009/4/21 Doron Cohen <cdor...@gmail.com> > > > > > Hi Liat, there are two packages under Lucene's contrib that deals with > > > Synonyms - that is contrib/memory and contrib/wordnet - which you > > > may find useful. I never used these two but they seem relevant to what > > > you are trying to achieve. > > > > > > Anyhow, it seems you compute the synonyms for word w are those > > > that appear in the same set of documents ('worlds') as w, and you find > > > this set by (a) indexing an inverse of the collection (docs become > words > > > and words become docs) and (b) using docs(w) as query do find syns(w). > > > > > > I assume that your 'worlds' are small, each containing only a small > > > set of a few related words, otherwise I would have two > > > concerns with this approach: (a) scalability (b) in a large doc (world) > > > this > > > approach ignores the vicinity of words which seems to me important > > > to their likelihood as synonyms > > > > > > Assuming you are okay here, and going back to original question of > > > altering the term frequency, perhaps taking the (search) scores of the > > > returned synonyms (which you find by search) is better than just > > > using their frequency? If you find this approach valid, then at least > for > > > some queries you should be able to use queries boosts. For example > > > create a BooleanQuery, add to it a TermQuery for each synonym, > > > but set the boost of the TermQuery according to the synonnym score. > > > This is also where you could "punish" synnonyms comparing to the > > > original word. This will only help with queries with contruction API > > > that takes (sub) queries as input (so it will not help with a > > PhraseQuery). > > > > > > - Doron > > > > > > On Tue, Apr 21, 2009 at 12:40 PM, liat oren <oren.l...@gmail.com> > wrote: > > > > > > > Ok, I will explain the full 'problem' and then explain how I approach > > it: > > > > > > > > Lets divide it into three steps: > > > > > > > > 1. I have a 'dictionary' of words - for every word, I have a list of > > > > worlds, > > > > which are ids of text documents that the word appears in. > > > > So, for example, for the word 'dog', I have '1 1600 36000' in the > > > "worlds" > > > > field (which are tokenized whin indexed) - which means that the word > > dog > > > > appears in worlds 1, 1600 and 36000. > > > > > > > > 2. This index is used to choose synonyms for the word dog - using the > > > > "worlds" field - I do a search on this index, giving the query "'1 > 1600 > > > > 36000" as in input and thus get the words that are close to the word > > > "dog". > > > > I take the 10 closest words. > > > > > > > > 3. These 10 synonyms are then used to expand the query. > > > > > > > > Basically, I have 2 problems in this process: > > > > > > > > a. In the process of finding the synonyms, I would like that the > > > frequency > > > > of the word in each of the worlds will be taken into account. so that > > if > > > > 'dog' appeared 3 times in world 1, 10 times in world 1600 and 4 times > > in > > > > world 36000, then it will be taken into account. > > > > I wanted to avoid "expanding" the field to be "1 1 1 1600 1600 1600 > > 1600 > > > > 1600 1600 1600 1600 1600 1600 36000 36000 36000 36000". Accordingly I > > > > wanted > > > > to be able to set the freq by myself. > > > > > > > > b. In the process of using the synonyms, I wanted to be able to set a > > > > 'penalty' factor to the synonyms, together with giving differnt > weight > > to > > > > differnt synonyms, according to theur score. I looked at an old > thread > > - > > > > Search for synonyms - implemenetation for review : > > > > . > > > > > > > > > > > > > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39b0fb508e5d7540aca5ad57225e150d392...@xmail.me.corp.entopia.com%3e > > > > > > > > I don;t know if its part of lucene now. I didn't quite understand how > > to > > > > use > > > > it. > > > > Is there a better way to approach it? > > > > > > > > I hope I explained it well. > > > > Thanks, > > > > Liat > > > > > > > > > > > > > > > > 2009/4/21 Doron Cohen <cdor...@gmail.com> > > > > > > > > > Depending on the problem you are trying to solve there may be other > > > > > solutions to it, not requiring setting wrong (?) values for term > > > > > frequencies. > > > > > If you can explain what you are trying to solve, people on the list > > may > > > > > be able to suggest such alternatives. > > > > > - Doron > > > > > > > > > > On Sun, Apr 19, 2009 at 2:39 PM, liat oren <oren.l...@gmail.com> > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I would like to be able to set the term freq to differnt values > at > > > > index > > > > > > time, or at search time. > > > > > > > > > > > > So if a document has the following text: 1 2, the freq of 1 will > > get > > > > 100 > > > > > > and > > > > > > the freq of 2 will get 200. I want to avoid expanding it by > writing > > 1 > > > > 100 > > > > > > times. > > > > > > > > > > > > I looked at Similarity class and wanted to override it, but the > tf > > > > > function > > > > > > gets only freq, so I don't know for which term this freq relates > > to, > > > > thus > > > > > I > > > > > > can't change the value. > > > > > > > > > > > > Thanks, > > > > > > Liat > > > > > > > > > > > > > > > > > > > > >