I think the lucene WhitespaceAnalyzer I am using inside Solr's SynonymFilter is the one that prevents multi-word synonyms like "New York" from getting mapped to the generic synonym name like CONCEPTYcity. It appears to me that an analyzer which recognizes that a white-space is inside a synonym like "New York" will be required. Do I need to implement one like this or is there already an analyzer I can use? Looks like I am missing something here, since Solr's SynonymFilter is supposed to handle this. Can someone tell me what is the correct way to integrate Solr's SynonymFilter within a custom lucene analyzer? Thanks.
On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan <arunrangara...@gmail.com>wrote: > I am trying to have multi-word synonyms work in lucene using Solr's * > SynonymFilter*. > > I need to match synonyms at index time, since many of the synonym lists are > huge. Actually they are really not synonyms, but are words that belong to a > concept. For example, I would like to map {"New York", "Los Angeles", "New > Orleans", "Salt Lake City"...}, a bunch of city names, to the concept called > "city". While searching, the user query for the concept "city" will be > translated to a keyword like, say "CONCEPTcity", which is the synonym for > any city name. > > Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. 131), > all I could match for "CONCEPTcity" is single word city names like > "Chicago", "Seattle", "Boston", etc., It would not match multi-word city > names like "New York", "Los Angeles", etc., > > I tried using Solr's SynonymFilter in tokenStream method in a custom > Analyzer (that extends org.apache.lucene.analysis. > Analyzer - lucene ver. 2.9.3) using: > > * public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream result = new SynonymFilter( > new WhitespaceTokenizer(reader), > synonymMap); > return result; > } > * > where *synonymMap* is loaded with synonyms using > > *synonymMap.add(conceptTerms, listOfTokens, true, true);* > > where *conceptTerms* is of type *ArrayList<String>* of all the terms in a > concept and *listofTokens* is of type *List<Token> *and contains only the > generic synonym identifier like *CONCEPTcity*. > > When I print synonymMap using synonymMap.toString(), I get the output like > > <{New York=<{Chicago=<{Seattle=<{New > Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}> > > so it looks like all the synonyms are loaded. But if I search for > "CONCEPTcity" then it says no matches found. I am not sure whether I have > loaded the synonyms correctly in the synonymMap. > > Any help will be deeply appreciated. Thanks! >