Another problem how to deal with c++c++ or c++abc using MappingCharFilter
i use a NormalizeMap("c++","cplusplus"), the analyzed result will be cpluspluscplusplus or cplusplusabc wich is not what i want if i use a NormalizeMap("c++","cplusplus$"), the offset will not be correct. On Sun, Dec 13, 2009 at 9:38 PM, Weiwei Wang <ww.wang...@gmail.com> wrote: > Thank you very much, Uwe, I found the problem. > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> > >> MappingCharFilter definitely preserves the offsets from the original >> reader. >> Yo can verify that for your case with Lucene’s testcase >> TestMappingCharFilter in the source distribution @ >> /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java: >> >> public void test2to4() throws Exception { >> CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" ) >> ); >> TokenStream ts = new WhitespaceTokenizer( cs ); >> assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new >> int[] >> {2}); >> } >> >> So there is everything correct. I tried this test also with >> StandrdTokenizer >> instead of WhiteSpaceTokenizer - it works and asserts the correct offsets. >> You should debug through the incrementToken()/CharFilter calls and verify >> where your offsets change. >> >> I cannot help more. >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -----Original Message----- >> > From: Weiwei Wang [mailto:ww.wang...@gmail.com] >> > Sent: Sunday, December 13, 2009 12:51 PM >> > To: java-user@lucene.apache.org >> > Subject: Re: Recover special terms from StandardTokenizer >> > >> > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to >> > provide a NormalizeCharMap. We lowercase the stream so as we only >> provide >> > lowercase maps in the NormalizeCharMap, e.g. we provide map >> > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus). >> > >> > C++ is only an example we want to fix, in the future we may add more >> such >> > special terms >> > >> > the code for LowercaseCharFilter is as follows: >> > package analysis; >> > >> > import java.io.IOException; >> > import java.io.Reader; >> > >> > import org.apache.lucene.analysis.BaseCharFilter; >> > import org.apache.lucene.analysis.CharReader; >> > import org.apache.lucene.analysis.CharStream; >> > >> > >> > public class LowercaseCharFilter extends BaseCharFilter >> > { >> > >> > public LowercaseCharFilter(CharStream in) >> > { >> > super(in); >> > } >> > >> > public LowercaseCharFilter(Reader in) >> > { >> > super(CharReader.get(in)); >> > } >> > @Override >> > public int read() throws IOException >> > { >> > return Character.toLowerCase(input.read()); >> > } >> > @Override >> > public int read(char[] cbuf, int off, int len) throws IOException { >> > int ret = input.read(cbuf, off, len); >> > if(ret!=-1) >> > { >> > for(int i=off; i<off+ret; i++) >> > cbuf[i] = Character.toLowerCase(cbuf[i]); >> > } >> > return ret; >> > } >> > } >> > >> > >> > Currently RosaMappingCharFilter is inherited from MappingCharFilter and >> > nothing is changed(i was planning to override addOffCorrectMap to fix my >> > problem, but it didn't work) >> > >> > >> > 2009/12/13 Uwe Schindler <u...@thetaphi.de> >> > >> > > I think your problem is theLowercaseCharFilter that does not pass >> > > correctOffset() to the underying CharFilter. Does it work better >> without >> > > your LowerCaseCharFilter (which is duplicate because there is already >> a >> > > LowerCaseFilter in the Tokenizer chain). >> > > >> > > As you are only looking for "c++", just also add a mapping for "C++" >> and >> > > you >> > > are done, why lowercasing all because of one char? >> > > >> > > And what's RosaMappingCharFilter? A pink one? *g* >> > > >> > > ----- >> > > Uwe Schindler >> > > H.-H.-Meier-Allee 63, D-28213 Bremen >> > > http://www.thetaphi.de >> > > eMail: u...@thetaphi.de >> > > >> > > > -----Original Message----- >> > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] >> > > > Sent: Sunday, December 13, 2009 12:23 PM >> > > > To: java-user@lucene.apache.org >> > > > Subject: Re: Recover special terms from StandardTokenizer >> > > > >> > > > thanks, Uwe. >> > > > Maybe i was not very clear. My situation is like this: >> > > > Analyzer: >> > > > NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); >> > > > RECOVERY_MAP.add("c++","cplusplus$"); >> > > > CharFilter filter = new LowercaseCharFilter(reader); >> > > > filter = new RosaMappingCharFilter(RECOVERY_MAP,filter); >> > > > StandardTokenizer tokenStream = new >> > > > StandardTokenizer(Version.LUCENE_30, >> > > > filter); >> > > > tokenStream.setMaxTokenLength(maxTokenLength); >> > > > TokenStream result = new StandardFilter(tokenStream); >> > > > result = getStopFilter(result); >> > > > result = new SnowballFilter(result, STEMMER); >> > > > Analyze c++c++, return >> > > > (0,9) [cplusplus] >> > > > (10,19) [cplusplus] >> > > > the two numbers in th**e brackets are offsets. >> > > > >> > > > So in the searching process when i want to hight the search keyword >> > c++ >> > > > with >> > > > the same analyzer, exception will be thrown because the string i >> > stored >> > > > are >> > > > c++c++ not cpluspluscplusplus(actually, i should not change the >> > original >> > > > string when storing them, otherwise it will confuse the users). >> > > > >> > > > I hope the analyzer can give result like this >> > > > (0,3) [cplusplus] >> > > > (3,6) [cplusplus] >> > > > then the Hilighter will works fine. >> > > > >> > > > So how can I achieve this result? >> > > > >> > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de> >> > > > >> > > > > MappingCharFilter preserves the offsets in the stream *before* >> > > > filtering. >> > > > > So >> > > > > if you store the original string (without c++ replaced) in a >> stored >> > > > field >> > > > > you can highlight using the given offstes. The highlighter must >> use >> > > > again >> > > > > the same analyzer or use FastVectorHighlighter. >> > > > > >> > > > > ----- >> > > > > Uwe Schindler >> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen >> > > > > http://www.thetaphi.de >> > > > > eMail: u...@thetaphi.de >> > > > > >> > > > > > -----Original Message----- >> > > > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com] >> > > > > > Sent: Sunday, December 13, 2009 11:43 AM >> > > > > > To: java-user@lucene.apache.org >> > > > > > Subject: Re: Recover special terms from StandardTokenizer >> > > > > > >> > > > > > Problem solved. Now another problem comes. >> > > > > > >> > > > > > >> > > > > > As I want to use Highlighter in my system, the token offset is >> > > > incorrect >> > > > > > after the MappingCharFilter is used. >> > > > > > >> > > > > > Koji, do you known how to fix the offset problem? >> > > > > > >> > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang >> > <ww.wang...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > > > I use Luke to check the result and find only c exists as a >> term, >> > no >> > > > > > > cplusplus found in the index >> > > > > > > >> > > > > > > >> > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang >> > > > > > <ww.wang...@gmail.com>wrote: >> > > > > > > >> > > > > > >> Thanks, Koji, I followed your advice and change my analyzer >> as >> > > > shown >> > > > > > >> below: >> > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap(); >> > > > > > >> RECOVERY_MAP.add("c++","cplusplus$"); >> > > > > > >> CharFilter filter = new LowercaseCharFilter(reader); >> > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter); >> > > > > > >> StandardTokenizer tokenStream = new >> > > > > > StandardTokenizer(Version.LUCENE_30, >> > > > > > >> filter); >> > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength); >> > > > > > >> TokenStream result = new StandardFilter(tokenStream); >> > > > > > >> result = new LowerCaseFilter(result); >> > > > > > >> result = new StopFilter(enableStopPositionIncrements, result, >> > > > > stopSet); >> > > > > > >> result = new SnowballFilter(result, STEMMER); >> > > > > > >> >> > > > > > >> I use the same analyzer in the search side. As you know, this >> > > > analyzer >> > > > > > can >> > > > > > >> token c++ as cplusplus, for this reason, it seems I can >> search >> > c++ >> > > > > with >> > > > > > >> the same analyzer because it is also tokenized as cplusplus. >> > > > > > >> >> > > > > > >> I tested it on as string c++c++, however, when i search c++ >> on >> > the >> > > > > > built >> > > > > > >> index, nothing is returned. >> > > > > > >> >> > > > > > >> I do not know what's wrong with my code. Waiting for your >> > replay >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang >> > > > > > <ww.wang...@gmail.com>wrote: >> > > > > > >> >> > > > > > >>> Thanks, Koji >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi >> > > > > > <k...@r.email.ne.jp>wrote: >> > > > > > >>> >> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus. >> > > > > > >>>> >> > > > > > >>>> Koji >> > > > > > >>>> >> > > > > > >>>> -- >> > > > > > >>>> http://www.rondhuit.com/en/ >> > > > > > >>>> >> > > > > > >>>> >> > > > > > >>>> >> > > > > > >>>> Anshum wrote: >> > > > > > >>>> >> > > > > > >>>>> How about getting the original token stream and then >> > converting >> > > > c++ >> > > > > > to >> > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might >> > look >> > > > at >> > > > > > >>>>> using/extending(in the non java sense) some other >> tokenized! >> > > > > > >>>>> >> > > > > > >>>>> -- >> > > > > > >>>>> Anshum Gupta >> > > > > > >>>>> Naukri Labs! >> > > > > > >>>>> http://ai-cafe.blogspot.com >> > > > > > >>>>> >> > > > > > >>>>> The facts expressed here belong to everybody, the opinions >> > to >> > > > me. >> > > > > > The >> > > > > > >>>>> distinction is yours to draw............ >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang < >> > > > > ww.wang...@gmail.com> >> > > > > > >>>>> wrote: >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>>> Hi, all, >> > > > > > >>>>>> I designed a ftp search engine based on Lucene. I did >> a >> > few >> > > > > > >>>>>> modifications to the StandardTokenizer. >> > > > > > >>>>>> My problem is: >> > > > > > >>>>>> C++ is tokenized as c from StandardTokenizer and I want >> to >> > > > > recover >> > > > > > it >> > > > > > >>>>>> from >> > > > > > >>>>>> the TokenStream from StandardTokenizer >> > > > > > >>>>>> >> > > > > > >>>>>> What should I do? >> > > > > > >>>>>> >> > > > > > >>>>>> -- >> > > > > > >>>>>> Weiwei Wang >> > > > > > >>>>>> Alex Wang >> > > > > > >>>>>> 王巍巍 >> > > > > > >>>>>> Room 403, Mengmin Wei Building >> > > > > > >>>>>> Computer Science Department >> > > > > > >>>>>> Gulou Campus of Nanjing University >> > > > > > >>>>>> Nanjing, P.R.China, 210093 >> > > > > > >>>>>> >> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > > >>>>>> >> > > > > > >>>>>> >> > > > > > >>>>>> >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>>> >> > > > > > >>>> >> > > > > > >>>> >> > > > > > >>>> >> > > > > > >>>> >> > > > > >> -------------------------------------------------------------------- >> > - >> > > > > > >>>> To unsubscribe, e-mail: java-user- >> > unsubscr...@lucene.apache.org >> > > > > > >>>> For additional commands, e-mail: >> > > java-user-h...@lucene.apache.org >> > > > > > >>>> >> > > > > > >>>> >> > > > > > >>> >> > > > > > >>> >> > > > > > >>> -- >> > > > > > >>> Weiwei Wang >> > > > > > >>> Alex Wang >> > > > > > >>> 王巍巍 >> > > > > > >>> Room 403, Mengmin Wei Building >> > > > > > >>> Computer Science Department >> > > > > > >>> Gulou Campus of Nanjing University >> > > > > > >>> Nanjing, P.R.China, 210093 >> > > > > > >>> >> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > > >>> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> -- >> > > > > > >> Weiwei Wang >> > > > > > >> Alex Wang >> > > > > > >> 王巍巍 >> > > > > > >> Room 403, Mengmin Wei Building >> > > > > > >> Computer Science Department >> > > > > > >> Gulou Campus of Nanjing University >> > > > > > >> Nanjing, P.R.China, 210093 >> > > > > > >> >> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > > >> >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > Weiwei Wang >> > > > > > > Alex Wang >> > > > > > > 王巍巍 >> > > > > > > Room 403, Mengmin Wei Building >> > > > > > > Computer Science Department >> > > > > > > Gulou Campus of Nanjing University >> > > > > > > Nanjing, P.R.China, 210093 >> > > > > > > >> > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Weiwei Wang >> > > > > > Alex Wang >> > > > > > 王巍巍 >> > > > > > Room 403, Mengmin Wei Building >> > > > > > Computer Science Department >> > > > > > Gulou Campus of Nanjing University >> > > > > > Nanjing, P.R.China, 210093 >> > > > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > >> > > > > >> > > > > >> -------------------------------------------------------------------- >> > - >> > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > Weiwei Wang >> > > > Alex Wang >> > > > 王巍巍 >> > > > Room 403, Mengmin Wei Building >> > > > Computer Science Department >> > > > Gulou Campus of Nanjing University >> > > > Nanjing, P.R.China, 210093 >> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > >> > > >> > >> > >> > -- >> > Weiwei Wang >> > Alex Wang >> > 王巍巍 >> > Room 403, Mengmin Wei Building >> > Computer Science Department >> > Gulou Campus of Nanjing University >> > Nanjing, P.R.China, 210093 >> > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > -- Weiwei Wang Alex Wang 王巍巍 Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang