LowercaseCharFilter is necessary, as in the MappingCharFilter we need to provide a NormalizeCharMap. We lowercase the stream so as we only provide lowercase maps in the NormalizeCharMap, e.g. we provide map (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
C++ is only an example we want to fix, in the future we may add more such special terms the code for LowercaseCharFilter is as follows: package analysis; import java.io.IOException; import java.io.Reader; import org.apache.lucene.analysis.BaseCharFilter; import org.apache.lucene.analysis.CharReader; import org.apache.lucene.analysis.CharStream; public class LowercaseCharFilter extends BaseCharFilter { public LowercaseCharFilter(CharStream in) { super(in); } public LowercaseCharFilter(Reader in) { super(CharReader.get(in)); } @Override public int read() throws IOException { return Character.toLowerCase(input.read()); } @Override public int read(char[] cbuf, int off, int len) throws IOException { int ret = input.read(cbuf, off, len); if(ret!=-1) { for(int i=off; i<off+ret; i++) cbuf[i] = Character.toLowerCase(cbuf[i]); } return ret; } } Currently RosaMappingCharFilter is inherited from MappingCharFilter and nothing is changed(i was planning to override addOffCorrectMap to fix my problem, but it didn't work) 2009/12/13 Uwe Schindler <u...@thetaphi.de> > I think your problem is theLowercaseCharFilter that does not pass > correctOffset() to the underying CharFilter. Does it work better without > your LowerCaseCharFilter (which is duplicate because there is already a > LowerCaseFilter in the Tokenizer chain). > > As you are only looking for "c++", just also add a mapping for "C++" and > you > are done, why lowercasing all because of one char? > > And what's RosaMappingCharFilter? A pink one? *g* > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > Sent: Sunday, December 13, 2009 12:23 PM > > To: java-user@lucene.apache.org > > Subject: Re: Recover special terms from StandardTokenizer > > > > thanks, Uwe. > > Maybe i was not very clear. My situation is like this: > > Analyzer: > > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > RECOVERY_MAP.add("c++","cplusplus$"); > > CharFilter filter = new LowercaseCharFilter(reader); > > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); > > StandardTokenizer tokenStream = new > > StandardTokenizer(Version.LUCENE_30, > > filter); > > tokenStream.setMaxTokenLength(maxTokenLength); > > TokenStream result = new StandardFilter(tokenStream); > > result = getStopFilter(result); > > result = new SnowballFilter(result, STEMMER); > > Analyze c++c++, return > > (0,9) [cplusplus] > > (10,19) [cplusplus] > > the two numbers in th**e brackets are offsets. > > > > So in the searching process when i want to hight the search keyword c++ > > with > > the same analyzer, exception will be thrown because the string i stored > > are > > c++c++ not cpluspluscplusplus(actually, i should not change the original > > string when storing them, otherwise it will confuse the users). > > > > I hope the analyzer can give result like this > > (0,3) [cplusplus] > > (3,6) [cplusplus] > > then the Hilighter will works fine. > > > > So how can I achieve this result? > > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> > > > > > MappingCharFilter preserves the offsets in the stream *before* > > filtering. > > > So > > > if you store the original string (without c++ replaced) in a stored > > field > > > you can highlight using the given offstes. The highlighter must use > > again > > > the same analyzer or use FastVectorHighlighter. > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > > > > > -----Original Message----- > > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > > > Sent: Sunday, December 13, 2009 11:43 AM > > > > To: java-user@lucene.apache.org > > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > > > Problem solved. Now another problem comes. > > > > > > > > > > > > As I want to use Highlighter in my system, the token offset is > > incorrect > > > > after the MappingCharFilter is used. > > > > > > > > Koji, do you known how to fix the offset problem? > > > > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang...@gmail.com> > > > > wrote: > > > > > > > > > I use Luke to check the result and find only c exists as a term, no > > > > > cplusplus found in the index > > > > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > > > <ww.wang...@gmail.com>wrote: > > > > > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as > > shown > > > > >> below: > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > > > >> CharFilter filter = new LowercaseCharFilter(reader); > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > > > >> StandardTokenizer tokenStream = new > > > > StandardTokenizer(Version.LUCENE_30, > > > > >> filter); > > > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > > > >> TokenStream result = new StandardFilter(tokenStream); > > > > >> result = new LowerCaseFilter(result); > > > > >> result = new StopFilter(enableStopPositionIncrements, result, > > > stopSet); > > > > >> result = new SnowballFilter(result, STEMMER); > > > > >> > > > > >> I use the same analyzer in the search side. As you know, this > > analyzer > > > > can > > > > >> token c++ as cplusplus, for this reason, it seems I can search c++ > > > with > > > > >> the same analyzer because it is also tokenized as cplusplus. > > > > >> > > > > >> I tested it on as string c++c++, however, when i search c++ on the > > > > built > > > > >> index, nothing is returned. > > > > >> > > > > >> I do not know what's wrong with my code. Waiting for your replay > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > > > <ww.wang...@gmail.com>wrote: > > > > >> > > > > >>> Thanks, Koji > > > > >>> > > > > >>> > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > > > <k...@r.email.ne.jp>wrote: > > > > >>> > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > > > >>>> > > > > >>>> Koji > > > > >>>> > > > > >>>> -- > > > > >>>> http://www.rondhuit.com/en/ > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Anshum wrote: > > > > >>>> > > > > >>>>> How about getting the original token stream and then converting > > c++ > > > > to > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might look > > at > > > > >>>>> using/extending(in the non java sense) some other tokenized! > > > > >>>>> > > > > >>>>> -- > > > > >>>>> Anshum Gupta > > > > >>>>> Naukri Labs! > > > > >>>>> http://ai-cafe.blogspot.com > > > > >>>>> > > > > >>>>> The facts expressed here belong to everybody, the opinions to > > me. > > > > The > > > > >>>>> distinction is yours to draw............ > > > > >>>>> > > > > >>>>> > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > > > ww.wang...@gmail.com> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>>> Hi, all, > > > > >>>>>> I designed a ftp search engine based on Lucene. I did a few > > > > >>>>>> modifications to the StandardTokenizer. > > > > >>>>>> My problem is: > > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > > > recover > > > > it > > > > >>>>>> from > > > > >>>>>> the TokenStream from StandardTokenizer > > > > >>>>>> > > > > >>>>>> What should I do? > > > > >>>>>> > > > > >>>>>> -- > > > > >>>>>> Weiwei Wang > > > > >>>>>> Alex Wang > > > > >>>>>> 王巍巍 > > > > >>>>>> Room 403, Mengmin Wei Building > > > > >>>>>> Computer Science Department > > > > >>>>>> Gulou Campus of Nanjing University > > > > >>>>>> Nanjing, P.R.China, 210093 > > > > >>>>>> > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > --------------------------------------------------------------------- > > > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > >>>> For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > >>>> > > > > >>>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Weiwei Wang > > > > >>> Alex Wang > > > > >>> 王巍巍 > > > > >>> Room 403, Mengmin Wei Building > > > > >>> Computer Science Department > > > > >>> Gulou Campus of Nanjing University > > > > >>> Nanjing, P.R.China, 210093 > > > > >>> > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > >>> > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> Weiwei Wang > > > > >> Alex Wang > > > > >> 王巍巍 > > > > >> Room 403, Mengmin Wei Building > > > > >> Computer Science Department > > > > >> Gulou Campus of Nanjing University > > > > >> Nanjing, P.R.China, 210093 > > > > >> > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > Weiwei Wang > > > > > Alex Wang > > > > > 王巍巍 > > > > > Room 403, Mengmin Wei Building > > > > > Computer Science Department > > > > > Gulou Campus of Nanjing University > > > > > Nanjing, P.R.China, 210093 > > > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > > > > > > > > > > -- > > > > Weiwei Wang > > > > Alex Wang > > > > 王巍巍 > > > > Room 403, Mengmin Wei Building > > > > Computer Science Department > > > > Gulou Campus of Nanjing University > > > > Nanjing, P.R.China, 210093 > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Weiwei Wang > > Alex Wang > > 王巍巍 > > Room 403, Mengmin Wei Building > > Computer Science Department > > Gulou Campus of Nanjing University > > Nanjing, P.R.China, 210093 > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Weiwei Wang Alex Wang 王巍巍 Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang