Re: Recover special terms from StandardTokenizer

Weiwei Wang Sun, 13 Dec 2009 05:46:56 -0800

Another problem

how to deal with c++c++ or c++abc using MappingCharFilter


i use a NormalizeMap("c++","cplusplus"), the analyzed result will be
cpluspluscplusplus or cplusplusabc wich is not what i want

if i use a NormalizeMap("c++","cplusplus$"), the offset will not be correct.



On Sun, Dec 13, 2009 at 9:38 PM, Weiwei Wang <ww.wang...@gmail.com> wrote:

> Thank you very much, Uwe, I found the problem.
>
>
> 2009/12/13 Uwe Schindler <u...@thetaphi.de>
>
>> MappingCharFilter definitely preserves the offsets from the original
>> reader.
>> Yo can verify that for your case with Lucene’s testcase
>> TestMappingCharFilter in the source distribution @
>> /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java:
>>
>> public void test2to4() throws Exception {
>>  CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" )
>> );
>>  TokenStream ts = new WhitespaceTokenizer( cs );
>>  assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new
>> int[]
>> {2});
>> }
>>
>> So there is everything correct. I tried this test also with
>> StandrdTokenizer
>> instead of WhiteSpaceTokenizer - it works and asserts the correct offsets.
>> You should debug through the incrementToken()/CharFilter calls and verify
>> where your offsets change.
>>
>> I cannot help more.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Weiwei Wang [mailto:ww.wang...@gmail.com]
>> > Sent: Sunday, December 13, 2009 12:51 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: Recover special terms from StandardTokenizer
>> >
>> > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
>> > provide a NormalizeCharMap. We lowercase the stream so as we only
>> provide
>> > lowercase maps in the NormalizeCharMap, e.g. we provide map
>> > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
>> >
>> > C++ is only an example we want to fix, in the future we may add more
>> such
>> > special terms
>> >
>> > the code for LowercaseCharFilter is as follows:
>> > package analysis;
>> >
>> > import java.io.IOException;
>> > import java.io.Reader;
>> >
>> > import org.apache.lucene.analysis.BaseCharFilter;
>> > import org.apache.lucene.analysis.CharReader;
>> > import org.apache.lucene.analysis.CharStream;
>> >
>> >
>> > public class LowercaseCharFilter extends BaseCharFilter
>> > {
>> >
>> >     public LowercaseCharFilter(CharStream in)
>> >     {
>> >     super(in);
>> >     }
>> >
>> >     public LowercaseCharFilter(Reader in)
>> >     {
>> >     super(CharReader.get(in));
>> >     }
>> >     @Override
>> >     public int read() throws IOException
>> >     {
>> >     return Character.toLowerCase(input.read());
>> >     }
>> >     @Override
>> >     public int read(char[] cbuf, int off, int len) throws IOException {
>> >     int ret = input.read(cbuf, off, len);
>> >     if(ret!=-1)
>> >     {
>> >         for(int i=off; i<off+ret; i++)
>> >         cbuf[i] = Character.toLowerCase(cbuf[i]);
>> >     }
>> >     return ret;
>> >     }
>> > }
>> >
>> >
>> > Currently RosaMappingCharFilter is inherited from MappingCharFilter and
>> > nothing is changed(i was planning to override addOffCorrectMap to fix my
>> > problem, but it didn't work)
>> >
>> >
>> > 2009/12/13 Uwe Schindler <u...@thetaphi.de>
>> >
>> > > I think your problem is theLowercaseCharFilter that does not pass
>> > > correctOffset() to the underying CharFilter. Does it work better
>> without
>> > > your LowerCaseCharFilter (which is duplicate because there is already
>> a
>> > > LowerCaseFilter in the Tokenizer chain).
>> > >
>> > > As you are only looking for "c++", just also add a mapping for "C++"
>> and
>> > > you
>> > > are done, why lowercasing all because of one char?
>> > >
>> > > And what's RosaMappingCharFilter? A pink one? *g*
>> > >
>> > > -----
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > http://www.thetaphi.de
>> > > eMail: u...@thetaphi.de
>> > >
>> > > > -----Original Message-----
>> > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com]
>> > > > Sent: Sunday, December 13, 2009 12:23 PM
>> > > > To: java-user@lucene.apache.org
>> > > > Subject: Re: Recover special terms from StandardTokenizer
>> > > >
>> > > > thanks, Uwe.
>> > > > Maybe i was not very clear. My situation is like this:
>> > > > Analyzer:
>> > > >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> > > >    RECOVERY_MAP.add("c++","cplusplus$");
>> > > >     CharFilter filter = new LowercaseCharFilter(reader);
>> > > >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
>> > > >     StandardTokenizer tokenStream = new
>> > > > StandardTokenizer(Version.LUCENE_30,
>> > > > filter);
>> > > >     tokenStream.setMaxTokenLength(maxTokenLength);
>> > > >     TokenStream result = new StandardFilter(tokenStream);
>> > > >     result = getStopFilter(result);
>> > > >     result = new SnowballFilter(result, STEMMER);
>> > > > Analyze c++c++, return
>> > > > (0,9)  [cplusplus]
>> > > > (10,19)  [cplusplus]
>> > > > the two numbers in th**e brackets are offsets.
>> > > >
>> > > > So in the searching process when i want to hight the search keyword
>> > c++
>> > > > with
>> > > > the same analyzer, exception will be thrown because the string i
>> > stored
>> > > > are
>> > > > c++c++ not cpluspluscplusplus(actually, i should not change the
>> > original
>> > > > string when storing them, otherwise it will confuse the users).
>> > > >
>> > > > I hope the analyzer can give result like this
>> > > > (0,3) [cplusplus]
>> > > > (3,6) [cplusplus]
>> > > > then the Hilighter will works fine.
>> > > >
>> > > > So how can I achieve this result?
>> > > >
>> > > > 2009/12/13 Uwe Schindler <u...@thetaphi.de>
>> > > >
>> > > > > MappingCharFilter preserves the offsets in the stream *before*
>> > > > filtering.
>> > > > > So
>> > > > > if you store the original string (without c++ replaced) in a
>> stored
>> > > > field
>> > > > > you can highlight using the given offstes. The highlighter must
>> use
>> > > > again
>> > > > > the same analyzer or use FastVectorHighlighter.
>> > > > >
>> > > > > -----
>> > > > > Uwe Schindler
>> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > > > > http://www.thetaphi.de
>> > > > > eMail: u...@thetaphi.de
>> > > > >
>> > > > > > -----Original Message-----
>> > > > > > From: Weiwei Wang [mailto:ww.wang...@gmail.com]
>> > > > > > Sent: Sunday, December 13, 2009 11:43 AM
>> > > > > > To: java-user@lucene.apache.org
>> > > > > > Subject: Re: Recover special terms from StandardTokenizer
>> > > > > >
>> > > > > > Problem solved. Now another problem comes.
>> > > > > >
>> > > > > >
>> > > > > > As I want to use Highlighter in my system, the token offset is
>> > > > incorrect
>> > > > > > after the MappingCharFilter is used.
>> > > > > >
>> > > > > > Koji, do you known how to fix the offset problem?
>> > > > > >
>> > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang
>> > <ww.wang...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I use Luke to check the result and find only c exists as a
>> term,
>> > no
>> > > > > > > cplusplus found in the index
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
>> > > > > > <ww.wang...@gmail.com>wrote:
>> > > > > > >
>> > > > > > >> Thanks, Koji, I followed your advice and change my analyzer
>> as
>> > > > shown
>> > > > > > >> below:
>> > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
>> > > > > > >> RECOVERY_MAP.add("c++","cplusplus$");
>> > > > > > >> CharFilter filter = new LowercaseCharFilter(reader);
>> > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
>> > > > > > >> StandardTokenizer tokenStream = new
>> > > > > > StandardTokenizer(Version.LUCENE_30,
>> > > > > > >> filter);
>> > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
>> > > > > > >> TokenStream result = new StandardFilter(tokenStream);
>> > > > > > >> result = new LowerCaseFilter(result);
>> > > > > > >> result = new StopFilter(enableStopPositionIncrements, result,
>> > > > > stopSet);
>> > > > > > >> result = new SnowballFilter(result, STEMMER);
>> > > > > > >>
>> > > > > > >> I use the same analyzer in the search side. As you know, this
>> > > > analyzer
>> > > > > > can
>> > > > > > >> token c++ as cplusplus, for this reason, it seems I can
>> search
>> > c++
>> > > > > with
>> > > > > > >> the same analyzer because it is also tokenized as cplusplus.
>> > > > > > >>
>> > > > > > >> I tested it on as string c++c++, however, when i search c++
>> on
>> > the
>> > > > > > built
>> > > > > > >> index, nothing is returned.
>> > > > > > >>
>> > > > > > >>  I do not know what's wrong with my code. Waiting for your
>> > replay
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
>> > > > > > <ww.wang...@gmail.com>wrote:
>> > > > > > >>
>> > > > > > >>> Thanks, Koji
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
>> > > > > > <k...@r.email.ne.jp>wrote:
>> > > > > > >>>
>> > > > > > >>>> MappingCharFilter can be used to convert c++ to cplusplus.
>> > > > > > >>>>
>> > > > > > >>>> Koji
>> > > > > > >>>>
>> > > > > > >>>> --
>> > > > > > >>>> http://www.rondhuit.com/en/
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>> Anshum wrote:
>> > > > > > >>>>
>> > > > > > >>>>> How about getting the original token stream and then
>> > converting
>> > > > c++
>> > > > > > to
>> > > > > > >>>>> cplusplus or anyother such transform. Or perhaps you might
>> > look
>> > > > at
>> > > > > > >>>>> using/extending(in the non java sense) some other
>> tokenized!
>> > > > > > >>>>>
>> > > > > > >>>>> --
>> > > > > > >>>>> Anshum Gupta
>> > > > > > >>>>> Naukri Labs!
>> > > > > > >>>>> http://ai-cafe.blogspot.com
>> > > > > > >>>>>
>> > > > > > >>>>> The facts expressed here belong to everybody, the opinions
>> > to
>> > > > me.
>> > > > > > The
>> > > > > > >>>>> distinction is yours to draw............
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei Wang <
>> > > > > ww.wang...@gmail.com>
>> > > > > > >>>>> wrote:
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>> Hi, all,
>> > > > > > >>>>>>    I designed a ftp search engine based on Lucene. I did
>> a
>> > few
>> > > > > > >>>>>> modifications to the StandardTokenizer.
>> > > > > > >>>>>> My problem is:
>> > > > > > >>>>>>  C++ is tokenized as c from StandardTokenizer and I want
>> to
>> > > > > recover
>> > > > > > it
>> > > > > > >>>>>> from
>> > > > > > >>>>>> the TokenStream from StandardTokenizer
>> > > > > > >>>>>>
>> > > > > > >>>>>> What should I do?
>> > > > > > >>>>>>
>> > > > > > >>>>>> --
>> > > > > > >>>>>> Weiwei Wang
>> > > > > > >>>>>> Alex Wang
>> > > > > > >>>>>> 王巍巍
>> > > > > > >>>>>> Room 403, Mengmin Wei Building
>> > > > > > >>>>>> Computer Science Department
>> > > > > > >>>>>> Gulou Campus of Nanjing University
>> > > > > > >>>>>> Nanjing, P.R.China, 210093
>> > > > > > >>>>>>
>> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > >
>> --------------------------------------------------------------------
>> > -
>> > > > > > >>>> To unsubscribe, e-mail: java-user-
>> > unsubscr...@lucene.apache.org
>> > > > > > >>>> For additional commands, e-mail:
>> > > java-user-h...@lucene.apache.org
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> --
>> > > > > > >>> Weiwei Wang
>> > > > > > >>> Alex Wang
>> > > > > > >>> 王巍巍
>> > > > > > >>> Room 403, Mengmin Wei Building
>> > > > > > >>> Computer Science Department
>> > > > > > >>> Gulou Campus of Nanjing University
>> > > > > > >>> Nanjing, P.R.China, 210093
>> > > > > > >>>
>> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> --
>> > > > > > >> Weiwei Wang
>> > > > > > >> Alex Wang
>> > > > > > >> 王巍巍
>> > > > > > >> Room 403, Mengmin Wei Building
>> > > > > > >> Computer Science Department
>> > > > > > >> Gulou Campus of Nanjing University
>> > > > > > >> Nanjing, P.R.China, 210093
>> > > > > > >>
>> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >>
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Weiwei Wang
>> > > > > > > Alex Wang
>> > > > > > > 王巍巍
>> > > > > > > Room 403, Mengmin Wei Building
>> > > > > > > Computer Science Department
>> > > > > > > Gulou Campus of Nanjing University
>> > > > > > > Nanjing, P.R.China, 210093
>> > > > > > >
>> > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Weiwei Wang
>> > > > > > Alex Wang
>> > > > > > 王巍巍
>> > > > > > Room 403, Mengmin Wei Building
>> > > > > > Computer Science Department
>> > > > > > Gulou Campus of Nanjing University
>> > > > > > Nanjing, P.R.China, 210093
>> > > > > >
>> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >
>> > > > >
>> > > > >
>> --------------------------------------------------------------------
>> > -
>> > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Weiwei Wang
>> > > > Alex Wang
>> > > > 王巍巍
>> > > > Room 403, Mengmin Wei Building
>> > > > Computer Science Department
>> > > > Gulou Campus of Nanjing University
>> > > > Nanjing, P.R.China, 210093
>> > > >
>> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > Weiwei Wang
>> > Alex Wang
>> > 王巍巍
>> > Room 403, Mengmin Wei Building
>> > Computer Science Department
>> > Gulou Campus of Nanjing University
>> > Nanjing, P.R.China, 210093
>> >
>> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Re: Recover special terms from StandardTokenizer

Reply via email to