Re: How to support stemming and case folding for english content mixed with non-english content?

Robert Muir Mon, 08 Jun 2009 03:05:23 -0700

KK can you give me an example of some indian text for which it is doing this?


Thanks!

On Mon, Jun 8, 2009 at 1:03 AM, KK<dioxide.softw...@gmail.com> wrote:
> Hi Robert,
> The problem is that worddelimiterfilter is doing its job for english content
> but for non-english indian content which are unicoded it highlights the
> searched word but alongwith that it also highlights the characters of that
> word which was not hapenning without worddelimfilter, thats my concern. Say
> for example I searched for a hindi word say "xyz ab" [assume these are in
> hindi]  then in the search results it highlights these words but it also
> highlights x/y/z/a/b whereever these letters appear which is obiviously
> sounds bad. it should only highlight words not the letters therein. I hope I
> made it clear. What could be the reason for this? Any idea on fixing the
> same.
>
> Thanks,
> KK
>
> On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rcm...@gmail.com> wrote:
>
>> kk, i haven't had that experience with worddelimiterfilter on indian
>> languages, is it possible you could provide me an example of how its
>> creating nuisance?
>>
>> On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.softw...@gmail.com> wrote:
>> > Robert, I tried to use worddelimiterfilter from solr-nightly by putting
>> it
>> > in my working directory for this project, I set the parameters as you
>> told
>> > me. I must accept that its splitting words around those chars[like . @
>> > etc]but alongwith that its messing with other non-english/unicode
>> contents
>> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle
>> > around with my non-english content.
>> > This is what I'm doing,
>> > /**
>> >  * Analyzer for Indian language.
>> >  */
>> > public class IndicAnalyzer extends Analyzer {
>> >  public TokenStream tokenStream(String fieldName, Reader reader) {
>> >    TokenStream ts = new WhitespaceTokenizer(reader);
>> >    ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0);
>> >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >    ts = new LowerCaseFilter(ts);
>> >    ts = new PorterStemFilter(ts);
>> >    return ts;
>> >  }
>> > }
>> >
>> > I've to use the deprecated API for setting 5 values, thats fine, but
>> somehow
>> > its messing with unicode content. How to get rid of that? Any thougts? It
>> > seems setting those values is some proper way might fix the problem, I'm
>> not
>> > sure, though.
>> >
>> > Thanks,
>> > KK.
>> >
>> >
>> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcm...@gmail.com> wrote:
>> >
>> >> kk an easier solution to your first problem is to use
>> >> worddelimiterfilterfactory if possible... you can get an instance of
>> >> worddelimiter filter from that.
>> >>
>> >> thanks,
>> >> robert
>> >>
>> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcm...@gmail.com> wrote:
>> >> > kk as for your first issue, that WordDelimiterFilter is package
>> >> > protected, one option is to make a copy of the code and change the
>> >> > class declaration to public.
>> >> > the other option is to put your entire analyzer in
>> >> > 'org.apache.solr.analysis' package so that you can access it...
>> >> >
>> >> > for the 2nd issue, yes you need to supply some options to it. the
>> >> > default options solr applies to type 'text' seemed to work well for me
>> >> > with indic:
>> >> >
>> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1,
>> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> >> >
>> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.softw...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Thanks Robert. There is one problem though, I'm able to plugin the
>> word
>> >> >> delimiter filter from solr-nightly jar file. When I tried to do
>> >> something
>> >> >> like,
>> >> >>  TokenStream ts = new WhitespaceTokenizer(reader);
>> >> >>   ts = new WordDelimiterFilter(ts);
>> >> >>   ts = new PorterStemmerFilter(ts);
>> >> >>   ...rest as in the last mail...
>> >> >>
>> >> >> It gave me an error saying that
>> >> >>
>> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in
>> >> >> org.apache.solr.analysis; cannot be accessed from outside package
>> >> >> import org.apache.solr.analysis.WordDelimiterFilter;
>> >> >>                               ^
>> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol
>> >> >> symbol  : class WordDelimiterFilter
>> >> >> location: class solrSearch.IndicAnalyzer
>> >> >>    ts = new WordDelimiterFilter(ts);
>> >> >>             ^
>> >> >> 2 errors
>> >> >>
>> >> >> Then i tried to see the code for worddelimitefiter from solrnightly
>> src
>> >> and
>> >> >> found that there are many deprecated constructors though they require
>> a
>> >> lot
>> >> >> of parameters alongwith tokenstream. I went through the solr wiki for
>> >> >> worddelimiterfilterfactory here,
>> >> >>
>> >>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>> >> >> and say that there also its specified that we've to mention the
>> >> parameters
>> >> >> and both are different for indexing and querying.
>> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in
>> my
>> >> >> custom analyzer, I've to use it anyway.
>> >> >> In my code I've to make use of worddelimiterfilter and not
>> >> >> worddelimiterfilterfactory, right? I don't know whats the use of the
>> >> other
>> >> >> one. Anyway can you guide me getting rid of the above error. And yes
>> >> I'll
>> >> >> change the order of applying the filters as you said.
>> >> >>
>> >> >> Thanks,
>> >> >> KK.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcm...@gmail.com>
>> wrote:
>> >> >>
>> >> >> > KK, you got the right idea.
>> >> >> >
>> >> >> > though I think you might want to change the order, move the
>> stopfilter
>> >> >> > before the porter stem filter... otherwise it might not work
>> >> correctly.
>> >> >> >
>> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com>
>> >> wrote:
>> >> >> >
>> >> >> > > Thanks Robert. This is exactly what I did and  its working but
>> >> delimiter
>> >> >> > is
>> >> >> > > missing I'm going to add that from solr-nightly.jar
>> >> >> > >
>> >> >> > > /**
>> >> >> > >  * Analyzer for Indian language.
>> >> >> > >  */
>> >> >> > > public class IndicAnalyzer extends Analyzer {
>> >> >> > >  public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>> >> >> > >     TokenStream ts = new WhitespaceTokenizer(reader);
>> >> >> > >    ts = new PorterStemFilter(ts);
>> >> >> > >    ts = new LowerCaseFilter(ts);
>> >> >> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> >> > >    return ts;
>> >> >> > >  }
>> >> >> > > }
>> >> >> > >
>> >> >> > > Its able to do stemming/case-folding and supports search for both
>> >> english
>> >> >> > > and indic texts. let me try out the delimiter. Will update you on
>> >> that.
>> >> >> > >
>> >> >> > > Thanks a lot.
>> >> >> > > KK
>> >> >> > >
>> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com>
>> >> wrote:
>> >> >> > >
>> >> >> > > > i think you are on the right track... once you build your
>> >> analyzer, put
>> >> >> > > it
>> >> >> > > > in your classpath and play around with it in luke and see if it
>> >> does
>> >> >> > what
>> >> >> > > > you want.
>> >> >> > > >
>> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com
>> >
>> >> wrote:
>> >> >> > > >
>> >> >> > > > > Hi Robert,
>> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib
>> >> >> > > > >
>> >> >> > > > > public class ThaiAnalyzer extends Analyzer {
>> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
>> reader)
>> >> {
>> >> >> > > > >      TokenStream ts = new StandardTokenizer(reader);
>> >> >> > > > >    ts = new StandardFilter(ts);
>> >> >> > > > >    ts = new ThaiWordFilter(ts);
>> >> >> > > > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>> >> >> > > > >    return ts;
>> >> >> > > > >  }
>> >> >> > > > > }
>> >> >> > > > >
>> >> >> > > > > Now as you said, I've to use whitespacetokenizer
>> >> >> > > > > withworddelimitefilter[solr
>> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is
>> >> >> > something
>> >> >> > > > like
>> >> >> > > > > this,
>> >> >> > > > > public class IndicAnalyzer extends Analyzer {
>> >> >> > > > >  public TokenStream tokenStream(String fieldName, Reader
>> reader)
>> >> {
>> >> >> > > > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
>> >> >> > > > >   ts = new WordDelimiterFilter(ts);
>> >> >> > > > >   ts = new LowerCaseFilter(ts);
>> >> >> > > > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)
>> //
>> >> >> > english
>> >> >> > > > > stop filter, is this the default one?
>> >> >> > > > >   ts = new PorterFilter(ts);
>> >> >> > > > >   return ts;
>> >> >> > > > >  }
>> >> >> > > > > }
>> >> >> > > > >
>> >> >> > > > > Does this sound OK? I think it will do the job...let me try
>> it
>> >> out..
>> >> >> > > > > I dont need custom filter as per my requirement, at least not
>> >> for
>> >> >> > these
>> >> >> > > > > basic things I'm doing? I think so...
>> >> >> > > > >
>> >> >> > > > > Thanks,
>> >> >> > > > > KK.
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <
>> rcm...@gmail.com>
>> >> >> > wrote:
>> >> >> > > > >
>> >> >> > > > > > KK well you can always get some good examples from the
>> lucene
>> >> >> > contrib
>> >> >> > > > > > codebase.
>> >> >> > > > > > For example, look at the DutchAnalyzer, especially:
>> >> >> > > > > >
>> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader)
>> >> >> > > > > >
>> >> >> > > > > > See how it combines a specified tokenizer with various
>> >> filters?
>> >> >> > this
>> >> >> > > is
>> >> >> > > > > > what
>> >> >> > > > > > you want to do, except of course you want to use different
>> >> >> > tokenizer
>> >> >> > > > and
>> >> >> > > > > > filters.
>> >> >> > > > > >
>> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <
>> >> dioxide.softw...@gmail.com>
>> >> >> > > wrote:
>> >> >> > > > > >
>> >> >> > > > > > > Thanks Muir.
>> >> >> > > > > > > Thanks for letting me know that I dont need language
>> >> identifiers.
>> >> >> > > > > > >  I'll have a look and will try to write the analyzer. For
>> my
>> >> case
>> >> >> > I
>> >> >> > > > > think
>> >> >> > > > > > > it
>> >> >> > > > > > > wont be that difficult.
>> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials
>> writing
>> >> >> > custom
>> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is
>> >> something
>> >> >> > > > htere?
>> >> >> > > > > > do
>> >> >> > > > > > > let me know.
>> >> >> > > > > > >
>> >> >> > > > > > > Thanks,
>> >> >> > > > > > > KK.
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <
>> >> rcm...@gmail.com>
>> >> >> > > > wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > > KK, for your case, you don't really need to go to the
>> >> effort of
>> >> >> > > > > > detecting
>> >> >> > > > > > > > whether fragments are english or not.
>> >> >> > > > > > > > Because the English stemmers in lucene will not modify
>> >> your
>> >> >> > Indic
>> >> >> > > > > text,
>> >> >> > > > > > > and
>> >> >> > > > > > > > neither will the LowerCaseFilter.
>> >> >> > > > > > > >
>> >> >> > > > > > > > what you want to do is create a custom analyzer that
>> works
>> >> like
>> >> >> > > > this
>> >> >> > > > > > > >
>> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from
>> Solr
>> >> >> > nightly
>> >> >> > > > > jar],
>> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
>> >> >> > > > > > > >
>> >> >> > > > > > > > Thanks,
>> >> >> > > > > > > > Robert
>> >> >> > > > > > > >
>> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <
>> >> dioxide.softw...@gmail.com
>> >> >> > >
>> >> >> > > > > wrote:
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Thank you all.
>> >> >> > > > > > > > > To be frank I was using Solr in the begining half a
>> >> month
>> >> >> > ago.
>> >> >> > > > The
>> >> >> > > > > > > > > problem[rather bug] with solr was creation of new
>> index
>> >> on
>> >> >> > the
>> >> >> > > > fly.
>> >> >> > > > > > > > Though
>> >> >> > > > > > > > > they have a restful method for teh same, but it was
>> not
>> >> >> > > working.
>> >> >> > > > If
>> >> >> > > > > I
>> >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I
>> >> dont
>> >> >> > know
>> >> >> > > > his
>> >> >> > > > > > > real
>> >> >> > > > > > > > > name] was trying to help me. I tried many nightly
>> builds
>> >> and
>> >> >> > > > > spending
>> >> >> > > > > > a
>> >> >> > > > > > > > > couple of days stuck at that made me think of lucene
>> and
>> >> I
>> >> >> > > > switched
>> >> >> > > > > > to
>> >> >> > > > > > > > it.
>> >> >> > > > > > > > > Now after working with lucene which gives you full
>> >> control of
>> >> >> > > > > > > everything
>> >> >> > > > > > > > I
>> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene
>> is
>> >> >> > similar
>> >> >> > > > to
>> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back
>> to
>> >> the
>> >> >> > > > point
>> >> >> > > > > as
>> >> >> > > > > > > Uwe
>> >> >> > > > > > > > > mentioned that we can do the same thing in lucene as
>> >> well,
>> >> >> > what
>> >> >> > > > is
>> >> >> > > > > > > > > available
>> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right?
>> >> >> > > > > > > > > I request Uwe to give me some more ideas on using the
>> >> >> > analyzers
>> >> >> > > > > from
>> >> >> > > > > > > solr
>> >> >> > > > > > > > > that will do the job for me, handling a mix of both
>> >> english
>> >> >> > and
>> >> >> > > > > > > > non-english
>> >> >> > > > > > > > > content.
>> >> >> > > > > > > > > Muir, can you give me a bit detail description of how
>> to
>> >> use
>> >> >> > > the
>> >> >> > > > > > > > > WordDelimiteFilter to do my job.
>> >> >> > > > > > > > > On a side note, I was thingking of writing a simple
>> >> analyzer
>> >> >> > > that
>> >> >> > > > > > will
>> >> >> > > > > > > do
>> >> >> > > > > > > > > the following,
>> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its
>> >> some
>> >> >> > > indian
>> >> >> > > > > > > > language]
>> >> >> > > > > > > > > then index them as such, no stemming/ stop word
>> removal
>> >> to
>> >> >> > > begin
>> >> >> > > > > > with.
>> >> >> > > > > > > As
>> >> >> > > > > > > > I
>> >> >> > > > > > > > > know its in UCN unicode something like
>> >> >> > > > > \u0021\u0012\u34ae\u0031[just
>> >> >> > > > > > a
>> >> >> > > > > > > > > sample]
>> >> >> > > > > > > > > # If the fragment is english then apply standard
>> >> anlyzing
>> >> >> > > process
>> >> >> > > > > for
>> >> >> > > > > > > > > english content. I've not thought of quering in the
>> same
>> >> way
>> >> >> > as
>> >> >> > > > of
>> >> >> > > > > > now
>> >> >> > > > > > > > i.e
>> >> >> > > > > > > > > mix of non-english and engish words.
>> >> >> > > > > > > > > Now to get all this,
>> >> >> > > > > > > > >  #1. I need some sort of way which will let me know
>> if
>> >> the
>> >> >> > > > content
>> >> >> > > > > is
>> >> >> > > > > > > > > english or not. If not english just add the tokens to
>> >> the
>> >> >> > > > document.
>> >> >> > > > > > Do
>> >> >> > > > > > > we
>> >> >> > > > > > > > > really need language identifiers, as i dont have any
>> >> other
>> >> >> > > > content
>> >> >> > > > > > that
>> >> >> > > > > > > > > uses
>> >> >> > > > > > > > > the same script as english other than those \u1234
>> >> things for
>> >> >> > > my
>> >> >> > > > > > indian
>> >> >> > > > > > > > > language content. Any smart hack/trick for the same?
>> >> >> > > > > > > > >  #2. If the its english apply all normal process and
>> add
>> >> the
>> >> >> > > > > stemmed
>> >> >> > > > > > > > token
>> >> >> > > > > > > > > to document.
>> >> >> > > > > > > > > For all this I was thinking of iterating earch word
>> of
>> >> the
>> >> >> > web
>> >> >> > > > page
>> >> >> > > > > > and
>> >> >> > > > > > > > > apply the above procedure. And finallyadd  the newly
>> >> created
>> >> >> > > > > document
>> >> >> > > > > > > to
>> >> >> > > > > > > > > the
>> >> >> > > > > > > > > index.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > I would like some one to guide me in this direction.
>> I'm
>> >> >> > pretty
>> >> >> > > > > > people
>> >> >> > > > > > > > must
>> >> >> > > > > > > > > have done similar/same thing earlier, I request them
>> to
>> >> guide
>> >> >> > > me/
>> >> >> > > > > > point
>> >> >> > > > > > > > me
>> >> >> > > > > > > > > to some tutorials for the same.
>> >> >> > > > > > > > > Else help me out writing a custom analyzer only if
>> thats
>> >> not
>> >> >> > > > going
>> >> >> > > > > to
>> >> >> > > > > > > be
>> >> >> > > > > > > > > too
>> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know
>> basics
>> >> of
>> >> >> > Java
>> >> >> > > > > > coding.
>> >> >> > > > > > > > > Thank you very much.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > --KK.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <
>> >> >> > rcm...@gmail.com>
>> >> >> > > > > > wrote:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > > yes this is true. for starters KK, might be good to
>> >> startup
>> >> >> > > > solr
>> >> >> > > > > > and
>> >> >> > > > > > > > look
>> >> >> > > > > > > > > > at
>> >> >> > > > > > > > > >
>> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > if you want to stick with lucene, the
>> >> WordDelimiterFilter
>> >> >> > is
>> >> >> > > > the
>> >> >> > > > > > > piece
>> >> >> > > > > > > > > you
>> >> >> > > > > > > > > > will want for your text, mainly for punctuation but
>> >> also
>> >> >> > for
>> >> >> > > > > format
>> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
>> >> >> > > u...@thetaphi.de
>> >> >> > > > >
>> >> >> > > > > > > wrote:
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as
>> I
>> >> found
>> >> >> > > > out.
>> >> >> > > > > > > There
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > > an
>> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge
>> them.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > -----
>> >> >> > > > > > > > > > > Uwe Schindler
>> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> >> > > > > > > > > > > http://www.thetaphi.de
>> >> >> > > > > > > > > > > eMail: u...@thetaphi.de
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > > -----Original Message-----
>> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com]
>> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
>> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org
>> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case
>> >> folding
>> >> >> > for
>> >> >> > > > > > english
>> >> >> > > > > > > > > > content
>> >> >> > > > > > > > > > > > mixed with non-english content?
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the
>> >> english.
>> >> >> > This
>> >> >> > > > is
>> >> >> > > > > > > good.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > Is it possible for you to consider using solr?
>> >> solr's
>> >> >> > > > default
>> >> >> > > > > > > > > analyzer
>> >> >> > > > > > > > > > > for
>> >> >> > > > > > > > > > > > type 'text' will be good for your case. it will
>> do
>> >> the
>> >> >> > > > > > following
>> >> >> > > > > > > > > > > > 1. tokenize on whitespace
>> >> >> > > > > > > > > > > > 2. handle both indian language and english
>> >> punctuation
>> >> >> > > > > > > > > > > > 3. lowercase the english.
>> >> >> > > > > > > > > > > > 4. stem the english.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > try a nightly build,
>> >> >> > > > > > > > > > >
>> >> http://people.apache.org/builds/lucene/solr/nightly/
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
>> >> >> > > > > dioxide.softw...@gmail.com
>> >> >> > > > > > >
>> >> >> > > > > > > > > wrote:
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Muir, thanks for your response.
>> >> >> > > > > > > > > > > > > I'm indexing indian language web pages which
>> has
>> >> got
>> >> >> > > > > descent
>> >> >> > > > > > > > amount
>> >> >> > > > > > > > > > of
>> >> >> > > > > > > > > > > > > english content mixed with therein. For the
>> time
>> >> >> > being
>> >> >> > > > I'm
>> >> >> > > > > > not
>> >> >> > > > > > > > > going
>> >> >> > > > > > > > > > to
>> >> >> > > > > > > > > > > > use
>> >> >> > > > > > > > > > > > > any stemmers as we don't have standard
>> stemmers
>> >> for
>> >> >> > > > indian
>> >> >> > > > > > > > > languages
>> >> >> > > > > > > > > > .
>> >> >> > > > > > > > > > > > So
>> >> >> > > > > > > > > > > > > what I want to do is like this,
>> >> >> > > > > > > > > > > > > Say I've a web page having hindi content with
>> 5%
>> >> >> > > english
>> >> >> > > > > > > content.
>> >> >> > > > > > > > > > Then
>> >> >> > > > > > > > > > > > for
>> >> >> > > > > > > > > > > > > hindi I want to use the basic white space
>> >> analyzer as
>> >> >> > > we
>> >> >> > > > > dont
>> >> >> > > > > > > > have
>> >> >> > > > > > > > > > > > stemmers
>> >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever
>> >> english
>> >> >> > > > > appears
>> >> >> > > > > > I
>> >> >> > > > > > > > > want
>> >> >> > > > > > > > > > > > them
>> >> >> > > > > > > > > > > > > to
>> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process
>> >> used
>> >> >> > for
>> >> >> > > > > > english
>> >> >> > > > > > > > > > > content].
>> >> >> > > > > > > > > > > > As
>> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the
>> >> full
>> >> >> > > content
>> >> >> > > > > > which
>> >> >> > > > > > > > > > doesnot
>> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh
>> >> content.
>> >> >> > So
>> >> >> > > if
>> >> >> > > > > > there
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > an
>> >> >> > > > > > > > > > > > > english word say "Detection" indexed as such
>> >> then
>> >> >> > > > searching
>> >> >> > > > > > for
>> >> >> > > > > > > > > > > > detection
>> >> >> > > > > > > > > > > > > or
>> >> >> > > > > > > > > > > > > detect is not giving any results, which is
>> the
>> >> >> > expected
>> >> >> > > > > > > behavior,
>> >> >> > > > > > > > > but
>> >> >> > > > > > > > > > I
>> >> >> > > > > > > > > > > > > want
>> >> >> > > > > > > > > > > > > this kind of queries to give results.
>> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas
>> on
>> >> >> > doing
>> >> >> > > > the
>> >> >> > > > > > > same.
>> >> >> > > > > > > > > And
>> >> >> > > > > > > > > > > one
>> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage
>> content
>> >> >> > under
>> >> >> > > a
>> >> >> > > > > > single
>> >> >> > > > > > > > > > field,
>> >> >> > > > > > > > > > > I
>> >> >> > > > > > > > > > > > > hope this will not make any difference,
>> right?
>> >> >> > > > > > > > > > > > > It seems I've to use language identifiers,
>> but
>> >> do we
>> >> >> > > > really
>> >> >> > > > > > > need
>> >> >> > > > > > > > > > that?
>> >> >> > > > > > > > > > > > > Because we've only non-english content mixed
>> >> with
>> >> >> > > > > english[and
>> >> >> > > > > > > not
>> >> >> > > > > > > > > > > french
>> >> >> > > > > > > > > > > > or
>> >> >> > > > > > > > > > > > > russian etc].
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > What is the best way of approaching the
>> problem?
>> >> Any
>> >> >> > > > > > thoughts!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > KK.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
>> >> >> > > > > > rcm...@gmail.com>
>> >> >> > > > > > > > > > wrote:
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > KK, is all of your latin script text
>> actually
>> >> >> > > english?
>> >> >> > > > Is
>> >> >> > > > > > > there
>> >> >> > > > > > > > > > stuff
>> >> >> > > > > > > > > > > > > like
>> >> >> > > > > > > > > > > > > > german or french mixed in?
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > And for your non-english content (your
>> >> examples
>> >> >> > have
>> >> >> > > > been
>> >> >> > > > > > > > indian
>> >> >> > > > > > > > > > > > writing
>> >> >> > > > > > > > > > > > > > systems), is it generally true that if you
>> had
>> >> >> > > > > devanagari,
>> >> >> > > > > > > you
>> >> >> > > > > > > > > can
>> >> >> > > > > > > > > > > > assume
>> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi
>> >> mixed in?
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right
>> >> stemmers,
>> >> >> > > you
>> >> >> > > > > > really
>> >> >> > > > > > > > > need
>> >> >> > > > > > > > > > > > some
>> >> >> > > > > > > > > > > > > > language detection, but perhaps in your
>> case
>> >> you
>> >> >> > can
>> >> >> > > > > cheat
>> >> >> > > > > > > and
>> >> >> > > > > > > > > > detect
>> >> >> > > > > > > > > > > > > this
>> >> >> > > > > > > > > > > > > > based on scripts...
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > > Robert
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
>> >> >> > > > > > > > dioxide.softw...@gmail.com>
>> >> >> > > > > > > > > > > > wrote:
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > > Hi All,
>> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content.
>> But
>> >> the
>> >> >> > page
>> >> >> > > > > also
>> >> >> > > > > > > > > contains
>> >> >> > > > > > > > > > > > > english
>> >> >> > > > > > > > > > > > > > > content. As of now I'm using
>> >> WhitespaceAnalyzer
>> >> >> > for
>> >> >> > > > all
>> >> >> > > > > > > > content
>> >> >> > > > > > > > > > and
>> >> >> > > > > > > > > > > > I'm
>> >> >> > > > > > > > > > > > > > > storing the full webpage content under a
>> >> single
>> >> >> > > > filed.
>> >> >> > > > > > Now
>> >> >> > > > > > > we
>> >> >> > > > > > > > > > > > require
>> >> >> > > > > > > > > > > > > to
>> >> >> > > > > > > > > > > > > > > support case folding and stemmming for
>> the
>> >> >> > english
>> >> >> > > > > > content
>> >> >> > > > > > > > > > > > intermingled
>> >> >> > > > > > > > > > > > > > > with
>> >> >> > > > > > > > > > > > > > > non-english content. I must metion that
>> we
>> >> dont
>> >> >> > > have
>> >> >> > > > > > > stemming
>> >> >> > > > > > > > > and
>> >> >> > > > > > > > > > > > case
>> >> >> > > > > > > > > > > > > > > folding for these non-english content.
>> I'm
>> >> stuck
>> >> >> > > with
>> >> >> > > > > > this.
>> >> >> > > > > > > > > Some
>> >> >> > > > > > > > > > > one
>> >> >> > > > > > > > > > > > do
>> >> >> > > > > > > > > > > > > > let
>> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this
>> >> issue.
>> >> >> > > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > > Thanks,
>> >> >> > > > > > > > > > > > > > > KK.
>> >> >> > > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > > > > > rcm...@gmail.com
>> >> >> > > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > > > rcm...@gmail.com
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > >
>> >> >> > >
>> >> ---------------------------------------------------------------------
>> >> >> > > > > > > > > > > To unsubscribe, e-mail:
>> >> >> > > > > java-user-unsubscr...@lucene.apache.org
>> >> >> > > > > > > > > > > For additional commands, e-mail:
>> >> >> > > > > > java-user-h...@lucene.apache.org
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > --
>> >> >> > > > > > > > > > Robert Muir
>> >> >> > > > > > > > > > rcm...@gmail.com
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > --
>> >> >> > > > > > > > Robert Muir
>> >> >> > > > > > > > rcm...@gmail.com
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > > > --
>> >> >> > > > > > Robert Muir
>> >> >> > > > > > rcm...@gmail.com
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > --
>> >> >> > > > Robert Muir
>> >> >> > > > rcm...@gmail.com
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Robert Muir
>> >> >> > rcm...@gmail.com
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Robert Muir
>> >> > rcm...@gmail.com
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Robert Muir
>> >> rcm...@gmail.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to support stemming and case folding for english content mixed with non-english content?

Reply via email to