thanks, Uwe. Maybe i was not very clear. My situation is like this: Analyzer: NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); RECOVERY_MAP.add("c++","cplusplus$"); CharFilter filter = new LowercaseCharFilter(reader); filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); StandardTokenizer tokenStream = new StandardTokenizer(Version.LUCENE_30, filter); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result = new StandardFilter(tokenStream); result = getStopFilter(result); result = new SnowballFilter(result, STEMMER); Analyze c++c++, return (0,9) [cplusplus] (10,19) [cplusplus] the two numbers in th**e brackets are offsets.
So in the searching process when i want to hight the search keyword c++ with the same analyzer, exception will be thrown because the string i stored are c++c++ not cpluspluscplusplus(actually, i should not change the original string when storing them, otherwise it will confuse the users). I hope the analyzer can give result like this (0,3) [cplusplus] (3,6) [cplusplus] then the Hilighter will works fine. So how can I achieve this result? 2009/12/13 Uwe Schindler <u...@thetaphi.de> > MappingCharFilter preserves the offsets in the stream *before* filtering. > So > if you store the original string (without c++ replaced) in a stored field > you can highlight using the given offstes. The highlighter must use again > the same analyzer or use FastVectorHighlighter. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > Sent: Sunday, December 13, 2009 11:43 AM > > To: java-user@lucene.apache.org > > Subject: Re: Recover special terms from StandardTokenizer > > > > Problem solved. Now another problem comes. > > > > > > As I want to use Highlighter in my system, the token offset is incorrect > > after the MappingCharFilter is used. > > > > Koji, do you known how to fix the offset problem? > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang <ww.wang...@gmail.com> > > wrote: > > > > > I use Luke to check the result and find only c exists as a term, no > > > cplusplus found in the index > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > <ww.wang...@gmail.com>wrote: > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as shown > > >> below: > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > >> CharFilter filter = new LowercaseCharFilter(reader); > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > >> StandardTokenizer tokenStream = new > > StandardTokenizer(Version.LUCENE_30, > > >> filter); > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > >> TokenStream result = new StandardFilter(tokenStream); > > >> result = new LowerCaseFilter(result); > > >> result = new StopFilter(enableStopPositionIncrements, result, > stopSet); > > >> result = new SnowballFilter(result, STEMMER); > > >> > > >> I use the same analyzer in the search side. As you know, this analyzer > > can > > >> token c++ as cplusplus, for this reason, it seems I can search c++ > with > > >> the same analyzer because it is also tokenized as cplusplus. > > >> > > >> I tested it on as string c++c++, however, when i search c++ on the > > built > > >> index, nothing is returned. > > >> > > >> I do not know what's wrong with my code. Waiting for your replay > > >> > > >> > > >> > > >> > > >> > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > <ww.wang...@gmail.com>wrote: > > >> > > >>> Thanks, Koji > > >>> > > >>> > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > <k...@r.email.ne.jp>wrote: > > >>> > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > >>>> > > >>>> Koji > > >>>> > > >>>> -- > > >>>> http://www.rondhuit.com/en/ > > >>>> > > >>>> > > >>>> > > >>>> Anshum wrote: > > >>>> > > >>>>> How about getting the original token stream and then converting c++ > > to > > >>>>> cplusplus or anyother such transform. Or perhaps you might look at > > >>>>> using/extending(in the non java sense) some other tokenized! > > >>>>> > > >>>>> -- > > >>>>> Anshum Gupta > > >>>>> Naukri Labs! > > >>>>> http://ai-cafe.blogspot.com > > >>>>> > > >>>>> The facts expressed here belong to everybody, the opinions to me. > > The > > >>>>> distinction is yours to draw............ > > >>>>> > > >>>>> > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > ww.wang...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>>> Hi, all, > > >>>>>> I designed a ftp search engine based on Lucene. I did a few > > >>>>>> modifications to the StandardTokenizer. > > >>>>>> My problem is: > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > recover > > it > > >>>>>> from > > >>>>>> the TokenStream from StandardTokenizer > > >>>>>> > > >>>>>> What should I do? > > >>>>>> > > >>>>>> -- > > >>>>>> Weiwei Wang > > >>>>>> Alex Wang > > >>>>>> 王巍巍 > > >>>>>> Room 403, Mengmin Wei Building > > >>>>>> Computer Science Department > > >>>>>> Gulou Campus of Nanjing University > > >>>>>> Nanjing, P.R.China, 210093 > > >>>>>> > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> > > >>>> > --------------------------------------------------------------------- > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Weiwei Wang > > >>> Alex Wang > > >>> 王巍巍 > > >>> Room 403, Mengmin Wei Building > > >>> Computer Science Department > > >>> Gulou Campus of Nanjing University > > >>> Nanjing, P.R.China, 210093 > > >>> > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >>> > > >> > > >> > > >> > > >> -- > > >> Weiwei Wang > > >> Alex Wang > > >> 王巍巍 > > >> Room 403, Mengmin Wei Building > > >> Computer Science Department > > >> Gulou Campus of Nanjing University > > >> Nanjing, P.R.China, 210093 > > >> > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > >> > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > 王巍巍 > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > -- > > Weiwei Wang > > Alex Wang > > 王巍巍 > > Room 403, Mengmin Wei Building > > Computer Science Department > > Gulou Campus of Nanjing University > > Nanjing, P.R.China, 210093 > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Weiwei Wang Alex Wang 王巍巍 Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang