I think your problem is theLowercaseCharFilter that does not pass correctOffset() to the underying CharFilter. Does it work better without your LowerCaseCharFilter (which is duplicate because there is already a LowerCaseFilter in the Tokenizer chain).
As you are only looking for "c++", just also add a mapping for "C++" and you are done, why lowercasing all because of one char? And what's RosaMappingCharFilter? A pink one? *g* ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Sunday, December 13, 2009 12:23 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > thanks, Uwe. > Maybe i was not very clear. My situation is like this: > Analyzer: > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > RECOVERY_MAP.add("c++","cplusplus$"); > CharFilter filter = new LowercaseCharFilter(reader); > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); > StandardTokenizer tokenStream = new > StandardTokenizer(Version.LUCENE_30, > filter); > tokenStream.setMaxTokenLength(maxTokenLength); > TokenStream result = new StandardFilter(tokenStream); > result = getStopFilter(result); > result = new SnowballFilter(result, STEMMER); > Analyze c++c++, return > (0,9) [cplusplus] > (10,19) [cplusplus] > the two numbers in th**e brackets are offsets. > > So in the searching process when i want to hight the search keyword c++ > with > the same analyzer, exception will be thrown because the string i stored > are > c++c++ not cpluspluscplusplus(actually, i should not change the original > string when storing them, otherwise it will confuse the users). > > I hope the analyzer can give result like this > (0,3) [cplusplus] > (3,6) [cplusplus] > then the Hilighter will works fine. > > So how can I achieve this result? > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> > > > MappingCharFilter preserves the offsets in the stream *before* > filtering. > > So > > if you store the original string (without c++ replaced) in a stored > field > > you can highlight using the given offstes. The highlighter must use > again > > the same analyzer or use FastVectorHighlighter. > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -----Original Message----- > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > > Sent: Sunday, December 13, 2009 11:43 AM > > > To: java-user@lucene.apache.org > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > Problem solved. Now another problem comes. > > > > > > > > > As I want to use Highlighter in my system, the token offset is > incorrect > > > after the MappingCharFilter is used. > > > > > > Koji, do you known how to fix the offset problem? > > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang...@gmail.com> > > > wrote: > > > > > > > I use Luke to check the result and find only c exists as a term, no > > > > cplusplus found in the index > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > > <ww.wang...@gmail.com>wrote: > > > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as > shown > > > >> below: > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > > >> CharFilter filter = new LowercaseCharFilter(reader); > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > > >> StandardTokenizer tokenStream = new > > > StandardTokenizer(Version.LUCENE_30, > > > >> filter); > > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > > >> TokenStream result = new StandardFilter(tokenStream); > > > >> result = new LowerCaseFilter(result); > > > >> result = new StopFilter(enableStopPositionIncrements, result, > > stopSet); > > > >> result = new SnowballFilter(result, STEMMER); > > > >> > > > >> I use the same analyzer in the search side. As you know, this > analyzer > > > can > > > >> token c++ as cplusplus, for this reason, it seems I can search c++ > > with > > > >> the same analyzer because it is also tokenized as cplusplus. > > > >> > > > >> I tested it on as string c++c++, however, when i search c++ on the > > > built > > > >> index, nothing is returned. > > > >> > > > >> I do not know what's wrong with my code. Waiting for your replay > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > > <ww.wang...@gmail.com>wrote: > > > >> > > > >>> Thanks, Koji > > > >>> > > > >>> > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > > <k...@r.email.ne.jp>wrote: > > > >>> > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > > >>>> > > > >>>> Koji > > > >>>> > > > >>>> -- > > > >>>> http://www.rondhuit.com/en/ > > > >>>> > > > >>>> > > > >>>> > > > >>>> Anshum wrote: > > > >>>> > > > >>>>> How about getting the original token stream and then converting > c++ > > > to > > > >>>>> cplusplus or anyother such transform. Or perhaps you might look > at > > > >>>>> using/extending(in the non java sense) some other tokenized! > > > >>>>> > > > >>>>> -- > > > >>>>> Anshum Gupta > > > >>>>> Naukri Labs! > > > >>>>> http://ai-cafe.blogspot.com > > > >>>>> > > > >>>>> The facts expressed here belong to everybody, the opinions to > me. > > > The > > > >>>>> distinction is yours to draw............ > > > >>>>> > > > >>>>> > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > > ww.wang...@gmail.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>>> Hi, all, > > > >>>>>> I designed a ftp search engine based on Lucene. I did a few > > > >>>>>> modifications to the StandardTokenizer. > > > >>>>>> My problem is: > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > > recover > > > it > > > >>>>>> from > > > >>>>>> the TokenStream from StandardTokenizer > > > >>>>>> > > > >>>>>> What should I do? > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Weiwei Wang > > > >>>>>> Alex Wang > > > >>>>>> 王巍巍 > > > >>>>>> Room 403, Mengmin Wei Building > > > >>>>>> Computer Science Department > > > >>>>>> Gulou Campus of Nanjing University > > > >>>>>> Nanjing, P.R.China, 210093 > > > >>>>>> > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > --------------------------------------------------------------------- > > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > >>>> > > > >>>> > > > >>> > > > >>> > > > >>> -- > > > >>> Weiwei Wang > > > >>> Alex Wang > > > >>> 王巍巍 > > > >>> Room 403, Mengmin Wei Building > > > >>> Computer Science Department > > > >>> Gulou Campus of Nanjing University > > > >>> Nanjing, P.R.China, 210093 > > > >>> > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Weiwei Wang > > > >> Alex Wang > > > >> 王巍巍 > > > >> Room 403, Mengmin Wei Building > > > >> Computer Science Department > > > >> Gulou Campus of Nanjing University > > > >> Nanjing, P.R.China, 210093 > > > >> > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > >> > > > > > > > > > > > > > > > > -- > > > > Weiwei Wang > > > > Alex Wang > > > > 王巍巍 > > > > Room 403, Mengmin Wei Building > > > > Computer Science Department > > > > Gulou Campus of Nanjing University > > > > Nanjing, P.R.China, 210093 > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > 王巍巍 > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org