MappingCharFilter definitely preserves the offsets from the original reader. Yo can verify that for your case with Lucene’s testcase TestMappingCharFilter in the source distribution @ /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java: public void test2to4() throws Exception { CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" ) ); TokenStream ts = new WhitespaceTokenizer( cs ); assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new int[] {2}); }
So there is everything correct. I tried this test also with StandrdTokenizer instead of WhiteSpaceTokenizer - it works and asserts the correct offsets. You should debug through the incrementToken()/CharFilter calls and verify where your offsets change. I cannot help more. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Sunday, December 13, 2009 12:51 PM > To: java-user@lucene.apache.org > Subject: Re: Recover special terms from StandardTokenizer > > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to > provide a NormalizeCharMap. We lowercase the stream so as we only provide > lowercase maps in the NormalizeCharMap, e.g. we provide map > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus). > > C++ is only an example we want to fix, in the future we may add more such > special terms > > the code for LowercaseCharFilter is as follows: > package analysis; > > import java.io.IOException; > import java.io.Reader; > > import org.apache.lucene.analysis.BaseCharFilter; > import org.apache.lucene.analysis.CharReader; > import org.apache.lucene.analysis.CharStream; > > > public class LowercaseCharFilter extends BaseCharFilter > { > > public LowercaseCharFilter(CharStream in) > { > super(in); > } > > public LowercaseCharFilter(Reader in) > { > super(CharReader.get(in)); > } > @Override > public int read() throws IOException > { > return Character.toLowerCase(input.read()); > } > @Override > public int read(char[] cbuf, int off, int len) throws IOException { > int ret = input.read(cbuf, off, len); > if(ret!=-1) > { > for(int i=off; i<off+ret; i++) > cbuf[i] = Character.toLowerCase(cbuf[i]); > } > return ret; > } > } > > > Currently RosaMappingCharFilter is inherited from MappingCharFilter and > nothing is changed(i was planning to override addOffCorrectMap to fix my > problem, but it didn't work) > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> > > > I think your problem is theLowercaseCharFilter that does not pass > > correctOffset() to the underying CharFilter. Does it work better without > > your LowerCaseCharFilter (which is duplicate because there is already a > > LowerCaseFilter in the Tokenizer chain). > > > > As you are only looking for "c++", just also add a mapping for "C++" and > > you > > are done, why lowercasing all because of one char? > > > > And what's RosaMappingCharFilter? A pink one? *g* > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -----Original Message----- > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > > Sent: Sunday, December 13, 2009 12:23 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > thanks, Uwe. > > > Maybe i was not very clear. My situation is like this: > > > Analyzer: > > > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > RECOVERY_MAP.add("c++","cplusplus$"); > > > CharFilter filter = new LowercaseCharFilter(reader); > > > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); > > > StandardTokenizer tokenStream = new > > > StandardTokenizer(Version.LUCENE_30, > > > filter); > > > tokenStream.setMaxTokenLength(maxTokenLength); > > > TokenStream result = new StandardFilter(tokenStream); > > > result = getStopFilter(result); > > > result = new SnowballFilter(result, STEMMER); > > > Analyze c++c++, return > > > (0,9) [cplusplus] > > > (10,19) [cplusplus] > > > the two numbers in th**e brackets are offsets. > > > > > > So in the searching process when i want to hight the search keyword > c++ > > > with > > > the same analyzer, exception will be thrown because the string i > stored > > > are > > > c++c++ not cpluspluscplusplus(actually, i should not change the > original > > > string when storing them, otherwise it will confuse the users). > > > > > > I hope the analyzer can give result like this > > > (0,3) [cplusplus] > > > (3,6) [cplusplus] > > > then the Hilighter will works fine. > > > > > > So how can I achieve this result? > > > > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> > > > > > > > MappingCharFilter preserves the offsets in the stream *before* > > > filtering. > > > > So > > > > if you store the original string (without c++ replaced) in a stored > > > field > > > > you can highlight using the given offstes. The highlighter must use > > > again > > > > the same analyzer or use FastVectorHighlighter. > > > > > > > > ----- > > > > Uwe Schindler > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > http://www.thetaphi.de > > > > eMail: u...@thetaphi.de > > > > > > > > > -----Original Message----- > > > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > > > > > Sent: Sunday, December 13, 2009 11:43 AM > > > > > To: java-user@lucene.apache.org > > > > > Subject: Re: Recover special terms from StandardTokenizer > > > > > > > > > > Problem solved. Now another problem comes. > > > > > > > > > > > > > > > As I want to use Highlighter in my system, the token offset is > > > incorrect > > > > > after the MappingCharFilter is used. > > > > > > > > > > Koji, do you known how to fix the offset problem? > > > > > > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang > <ww.wang...@gmail.com> > > > > > wrote: > > > > > > > > > > > I use Luke to check the result and find only c exists as a term, > no > > > > > > cplusplus found in the index > > > > > > > > > > > > > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang > > > > > <ww.wang...@gmail.com>wrote: > > > > > > > > > > > >> Thanks, Koji, I followed your advice and change my analyzer as > > > shown > > > > > >> below: > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); > > > > > >> RECOVERY_MAP.add("c++","cplusplus$"); > > > > > >> CharFilter filter = new LowercaseCharFilter(reader); > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); > > > > > >> StandardTokenizer tokenStream = new > > > > > StandardTokenizer(Version.LUCENE_30, > > > > > >> filter); > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength); > > > > > >> TokenStream result = new StandardFilter(tokenStream); > > > > > >> result = new LowerCaseFilter(result); > > > > > >> result = new StopFilter(enableStopPositionIncrements, result, > > > > stopSet); > > > > > >> result = new SnowballFilter(result, STEMMER); > > > > > >> > > > > > >> I use the same analyzer in the search side. As you know, this > > > analyzer > > > > > can > > > > > >> token c++ as cplusplus, for this reason, it seems I can search > c++ > > > > with > > > > > >> the same analyzer because it is also tokenized as cplusplus. > > > > > >> > > > > > >> I tested it on as string c++c++, however, when i search c++ on > the > > > > > built > > > > > >> index, nothing is returned. > > > > > >> > > > > > >> I do not know what's wrong with my code. Waiting for your > replay > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang > > > > > <ww.wang...@gmail.com>wrote: > > > > > >> > > > > > >>> Thanks, Koji > > > > > >>> > > > > > >>> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi > > > > > <k...@r.email.ne.jp>wrote: > > > > > >>> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. > > > > > >>>> > > > > > >>>> Koji > > > > > >>>> > > > > > >>>> -- > > > > > >>>> http://www.rondhuit.com/en/ > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> Anshum wrote: > > > > > >>>> > > > > > >>>>> How about getting the original token stream and then > converting > > > c++ > > > > > to > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might > look > > > at > > > > > >>>>> using/extending(in the non java sense) some other tokenized! > > > > > >>>>> > > > > > >>>>> -- > > > > > >>>>> Anshum Gupta > > > > > >>>>> Naukri Labs! > > > > > >>>>> http://ai-cafe.blogspot.com > > > > > >>>>> > > > > > >>>>> The facts expressed here belong to everybody, the opinions > to > > > me. > > > > > The > > > > > >>>>> distinction is yours to draw............ > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < > > > > ww.wang...@gmail.com> > > > > > >>>>> wrote: > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>>> Hi, all, > > > > > >>>>>> I designed a ftp search engine based on Lucene. I did a > few > > > > > >>>>>> modifications to the StandardTokenizer. > > > > > >>>>>> My problem is: > > > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want to > > > > recover > > > > > it > > > > > >>>>>> from > > > > > >>>>>> the TokenStream from StandardTokenizer > > > > > >>>>>> > > > > > >>>>>> What should I do? > > > > > >>>>>> > > > > > >>>>>> -- > > > > > >>>>>> Weiwei Wang > > > > > >>>>>> Alex Wang > > > > > >>>>>> 王巍巍 > > > > > >>>>>> Room 403, Mengmin Wei Building > > > > > >>>>>> Computer Science Department > > > > > >>>>>> Gulou Campus of Nanjing University > > > > > >>>>>> Nanjing, P.R.China, 210093 > > > > > >>>>>> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > -------------------------------------------------------------------- > - > > > > > >>>> To unsubscribe, e-mail: java-user- > unsubscr...@lucene.apache.org > > > > > >>>> For additional commands, e-mail: > > java-user-h...@lucene.apache.org > > > > > >>>> > > > > > >>>> > > > > > >>> > > > > > >>> > > > > > >>> -- > > > > > >>> Weiwei Wang > > > > > >>> Alex Wang > > > > > >>> 王巍巍 > > > > > >>> Room 403, Mengmin Wei Building > > > > > >>> Computer Science Department > > > > > >>> Gulou Campus of Nanjing University > > > > > >>> Nanjing, P.R.China, 210093 > > > > > >>> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >>> > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Weiwei Wang > > > > > >> Alex Wang > > > > > >> 王巍巍 > > > > > >> Room 403, Mengmin Wei Building > > > > > >> Computer Science Department > > > > > >> Gulou Campus of Nanjing University > > > > > >> Nanjing, P.R.China, 210093 > > > > > >> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Weiwei Wang > > > > > > Alex Wang > > > > > > 王巍巍 > > > > > > Room 403, Mengmin Wei Building > > > > > > Computer Science Department > > > > > > Gulou Campus of Nanjing University > > > > > > Nanjing, P.R.China, 210093 > > > > > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Weiwei Wang > > > > > Alex Wang > > > > > 王巍巍 > > > > > Room 403, Mengmin Wei Building > > > > > Computer Science Department > > > > > Gulou Campus of Nanjing University > > > > > Nanjing, P.R.China, 210093 > > > > > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > > > > > > > -------------------------------------------------------------------- > - > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > -- > > > Weiwei Wang > > > Alex Wang > > > 王巍巍 > > > Room 403, Mengmin Wei Building > > > Computer Science Department > > > Gulou Campus of Nanjing University > > > Nanjing, P.R.China, 210093 > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org