Thanks, Lance. After exploring for a while, I used lucene's ShingleFilter followed by the SynonymFilter in Lucene in Action book. Then using the type attribute, I removed all the shingles which did not belong to any category.
On Wed, Aug 18, 2010 at 10:28 PM, Lance Norskog <goks...@gmail.com> wrote: > Yes, you need an analyzer that leaves successive words together as one > long term. This might be easier to do with the new CharFilter tool, > which processes text before it goes to the tokenizer. > > What you are doing here is similar to Parts-Of-Speech analysis, where > text analysis software parses a sentence and labels words 'Noun', > 'Verb', etc. One suite stores these labels as payloads on the terms. > This might be a better way to store your categories, rather than using > the synonym filter. > > On Wed, Aug 18, 2010 at 9:55 PM, Arun Rangarajan > <arunrangara...@gmail.com> wrote: > > I think the lucene WhitespaceAnalyzer I am using inside Solr's > SynonymFilter > > is the one that prevents multi-word synonyms like "New York" from getting > > mapped to the generic synonym name like CONCEPTYcity. It appears to me > that > > an analyzer which recognizes that a white-space is inside a synonym like > > "New York" will be required. Do I need to implement one like this or is > > there already an analyzer I can use? Looks like I am missing something > here, > > since Solr's SynonymFilter is supposed to handle this. Can someone tell > me > > what is the correct way to integrate Solr's SynonymFilter within a custom > > lucene analyzer? Thanks. > > > > > > On Tue, Aug 17, 2010 at 4:44 PM, Arun Rangarajan > > <arunrangara...@gmail.com>wrote: > > > >> I am trying to have multi-word synonyms work in lucene using Solr's * > >> SynonymFilter*. > >> > >> I need to match synonyms at index time, since many of the synonym lists > are > >> huge. Actually they are really not synonyms, but are words that belong > to a > >> concept. For example, I would like to map {"New York", "Los Angeles", > "New > >> Orleans", "Salt Lake City"...}, a bunch of city names, to the concept > called > >> "city". While searching, the user query for the concept "city" will be > >> translated to a keyword like, say "CONCEPTcity", which is the synonym > for > >> any city name. > >> > >> Using lucene's SynonymAnalyzer, as explained in Lucene in Action (p. > 131), > >> all I could match for "CONCEPTcity" is single word city names like > >> "Chicago", "Seattle", "Boston", etc., It would not match multi-word city > >> names like "New York", "Los Angeles", etc., > >> > >> I tried using Solr's SynonymFilter in tokenStream method in a custom > >> Analyzer (that extends org.apache.lucene.analysis. > >> Analyzer - lucene ver. 2.9.3) using: > >> > >> * public TokenStream tokenStream(String fieldName, Reader reader) { > >> TokenStream result = new SynonymFilter( > >> new WhitespaceTokenizer(reader), > >> synonymMap); > >> return result; > >> } > >> * > >> where *synonymMap* is loaded with synonyms using > >> > >> *synonymMap.add(conceptTerms, listOfTokens, true, true);* > >> > >> where *conceptTerms* is of type *ArrayList<String>* of all the terms in > a > >> concept and *listofTokens* is of type *List<Token> *and contains only > the > >> generic synonym identifier like *CONCEPTcity*. > >> > >> When I print synonymMap using synonymMap.toString(), I get the output > like > >> > >> <{New York=<{Chicago=<{Seattle=<{New > >> Orleans=....<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>....}> > >> > >> so it looks like all the synonyms are loaded. But if I search for > >> "CONCEPTcity" then it says no matches found. I am not sure whether I > have > >> loaded the synonyms correctly in the synonymMap. > >> > >> Any help will be deeply appreciated. Thanks! > >> > > > > > > -- > Lance Norskog > goks...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >