Re: How to support stemming and case folding for english content mixed with non-english content?

Robert Muir Fri, 05 Jun 2009 05:19:06 -0700

KK, you got the right idea.

though I think you might want to change the order, move the stopfilter
before the porter stem filter... otherwise it might not work correctly.


On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com> wrote:

> Thanks Robert. This is exactly what I did and  its working but delimiter is
> missing I'm going to add that from solr-nightly.jar
>
> /**
>  * Analyzer for Indian language.
>  */
> public class IndicAnalyzer extends Analyzer {
>  public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream ts = new WhitespaceTokenizer(reader);
>    ts = new PorterStemFilter(ts);
>    ts = new LowerCaseFilter(ts);
>    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
>    return ts;
>  }
> }
>
> Its able to do stemming/case-folding and supports search for both english
> and indic texts. let me try out the delimiter. Will update you on that.
>
> Thanks a lot.
> KK
>
> On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com> wrote:
>
> > i think you are on the right track... once you build your analyzer, put
> it
> > in your classpath and play around with it in luke and see if it does what
> > you want.
> >
> > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com> wrote:
> >
> > > Hi Robert,
> > > This is what I copied from ThaiAnalyzer @ lucene contrib
> > >
> > > public class ThaiAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >      TokenStream ts = new StandardTokenizer(reader);
> > >    ts = new StandardFilter(ts);
> > >    ts = new ThaiWordFilter(ts);
> > >    ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS);
> > >    return ts;
> > >  }
> > > }
> > >
> > > Now as you said, I've to use whitespacetokenizer
> > > withworddelimitefilter[solr
> > > nightly.jar] stop wordremoval, porter stemmer etc , so it is something
> > like
> > > this,
> > > public class IndicAnalyzer extends Analyzer {
> > >  public TokenStream tokenStream(String fieldName, Reader reader) {
> > >   TokenStream ts = new WhiteSpaceTokenizer(reader);
> > >   ts = new WordDelimiterFilter(ts);
> > >   ts = new LowerCaseFilter(ts);
> > >   ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS)   // english
> > > stop filter, is this the default one?
> > >   ts = new PorterFilter(ts);
> > >   return ts;
> > >  }
> > > }
> > >
> > > Does this sound OK? I think it will do the job...let me try it out..
> > > I dont need custom filter as per my requirement, at least not for these
> > > basic things I'm doing? I think so...
> > >
> > > Thanks,
> > > KK.
> > >
> > >
> > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcm...@gmail.com> wrote:
> > >
> > > > KK well you can always get some good examples from the lucene contrib
> > > > codebase.
> > > > For example, look at the DutchAnalyzer, especially:
> > > >
> > > > TokenStream tokenStream(String fieldName, Reader reader)
> > > >
> > > > See how it combines a specified tokenizer with various filters? this
> is
> > > > what
> > > > you want to do, except of course you want to use different tokenizer
> > and
> > > > filters.
> > > >
> > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.softw...@gmail.com>
> wrote:
> > > >
> > > > > Thanks Muir.
> > > > > Thanks for letting me know that I dont need language identifiers.
> > > > >  I'll have a look and will try to write the analyzer. For my case I
> > > think
> > > > > it
> > > > > wont be that difficult.
> > > > > BTW, can you point me to some sample codes/tutorials writing custom
> > > > > analyzers. I could not find something in LIA2ndEdn. Is something
> > htere?
> > > > do
> > > > > let me know.
> > > > >
> > > > > Thanks,
> > > > > KK.
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcm...@gmail.com>
> > wrote:
> > > > >
> > > > > > KK, for your case, you don't really need to go to the effort of
> > > > detecting
> > > > > > whether fragments are english or not.
> > > > > > Because the English stemmers in lucene will not modify your Indic
> > > text,
> > > > > and
> > > > > > neither will the LowerCaseFilter.
> > > > > >
> > > > > > what you want to do is create a custom analyzer that works like
> > this
> > > > > >
> > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly
> > > jar],
> > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter-
> > > > > >
> > > > > > Thanks,
> > > > > > Robert
> > > > > >
> > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.softw...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Thank you all.
> > > > > > > To be frank I was using Solr in the begining half a month ago.
> > The
> > > > > > > problem[rather bug] with solr was creation of new index on the
> > fly.
> > > > > > Though
> > > > > > > they have a restful method for teh same, but it was not
> working.
> > If
> > > I
> > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont know
> > his
> > > > > real
> > > > > > > name] was trying to help me. I tried many nightly builds and
> > > spending
> > > > a
> > > > > > > couple of days stuck at that made me think of lucene and I
> > switched
> > > > to
> > > > > > it.
> > > > > > > Now after working with lucene which gives you full control of
> > > > > everything
> > > > > > I
> > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar
> > to
> > > > > > > Window$:Linux, its my view only, though]. Coming back to the
> > point
> > > as
> > > > > Uwe
> > > > > > > mentioned that we can do the same thing in lucene as well, what
> > is
> > > > > > > available
> > > > > > > in Solr, Solr is based on Lucene only, right?
> > > > > > > I request Uwe to give me some more ideas on using the analyzers
> > > from
> > > > > solr
> > > > > > > that will do the job for me, handling a mix of both english and
> > > > > > non-english
> > > > > > > content.
> > > > > > > Muir, can you give me a bit detail description of how to use
> the
> > > > > > > WordDelimiteFilter to do my job.
> > > > > > > On a side note, I was thingking of writing a simple analyzer
> that
> > > > will
> > > > > do
> > > > > > > the following,
> > > > > > > #. If the webpage fragment is non-english[for me its some
> indian
> > > > > > language]
> > > > > > > then index them as such, no stemming/ stop word removal to
> begin
> > > > with.
> > > > > As
> > > > > > I
> > > > > > > know its in UCN unicode something like
> > > \u0021\u0012\u34ae\u0031[just
> > > > a
> > > > > > > sample]
> > > > > > > # If the fragment is english then apply standard anlyzing
> process
> > > for
> > > > > > > english content. I've not thought of quering in the same way as
> > of
> > > > now
> > > > > > i.e
> > > > > > > mix of non-english and engish words.
> > > > > > > Now to get all this,
> > > > > > >  #1. I need some sort of way which will let me know if the
> > content
> > > is
> > > > > > > english or not. If not english just add the tokens to the
> > document.
> > > > Do
> > > > > we
> > > > > > > really need language identifiers, as i dont have any other
> > content
> > > > that
> > > > > > > uses
> > > > > > > the same script as english other than those \u1234 things for
> my
> > > > indian
> > > > > > > language content. Any smart hack/trick for the same?
> > > > > > >  #2. If the its english apply all normal process and add the
> > > stemmed
> > > > > > token
> > > > > > > to document.
> > > > > > > For all this I was thinking of iterating earch word of the web
> > page
> > > > and
> > > > > > > apply the above procedure. And finallyadd  the newly created
> > > document
> > > > > to
> > > > > > > the
> > > > > > > index.
> > > > > > >
> > > > > > > I would like some one to guide me in this direction. I'm pretty
> > > > people
> > > > > > must
> > > > > > > have done similar/same thing earlier, I request them to guide
> me/
> > > > point
> > > > > > me
> > > > > > > to some tutorials for the same.
> > > > > > > Else help me out writing a custom analyzer only if thats not
> > going
> > > to
> > > > > be
> > > > > > > too
> > > > > > > complex. LOL, I'm a new user to lucene and know basics of Java
> > > > coding.
> > > > > > > Thank you very much.
> > > > > > >
> > > > > > > --KK.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > yes this is true. for starters KK, might be good to startup
> > solr
> > > > and
> > > > > > look
> > > > > > > > at
> > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on
> > > > > > > >
> > > > > > > > if you want to stick with lucene, the WordDelimiterFilter is
> > the
> > > > > piece
> > > > > > > you
> > > > > > > > will want for your text, mainly for punctuation but also for
> > > format
> > > > > > > > characters such as ZWJ/ZWNJ.
> > > > > > > >
> > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler <
> u...@thetaphi.de
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > You can also re-use the solr analyzers, as far as I found
> > out.
> > > > > There
> > > > > > is
> > > > > > > > an
> > > > > > > > > issue in jIRA/discussion on java-dev to merge them.
> > > > > > > > >
> > > > > > > > > -----
> > > > > > > > > Uwe Schindler
> > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > > > > http://www.thetaphi.de
> > > > > > > > > eMail: u...@thetaphi.de
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com]
> > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM
> > > > > > > > > > To: java-user@lucene.apache.org
> > > > > > > > > > Subject: Re: How to support stemming and case folding for
> > > > english
> > > > > > > > content
> > > > > > > > > > mixed with non-english content?
> > > > > > > > > >
> > > > > > > > > > KK, ok, so you only really want to stem the english. This
> > is
> > > > > good.
> > > > > > > > > >
> > > > > > > > > > Is it possible for you to consider using solr? solr's
> > default
> > > > > > > analyzer
> > > > > > > > > for
> > > > > > > > > > type 'text' will be good for your case. it will do the
> > > > following
> > > > > > > > > > 1. tokenize on whitespace
> > > > > > > > > > 2. handle both indian language and english punctuation
> > > > > > > > > > 3. lowercase the english.
> > > > > > > > > > 4. stem the english.
> > > > > > > > > >
> > > > > > > > > > try a nightly build,
> > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK <
> > > dioxide.softw...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Muir, thanks for your response.
> > > > > > > > > > > I'm indexing indian language web pages which has got
> > > descent
> > > > > > amount
> > > > > > > > of
> > > > > > > > > > > english content mixed with therein. For the time being
> > I'm
> > > > not
> > > > > > > going
> > > > > > > > to
> > > > > > > > > > use
> > > > > > > > > > > any stemmers as we don't have standard stemmers for
> > indian
> > > > > > > languages
> > > > > > > > .
> > > > > > > > > > So
> > > > > > > > > > > what I want to do is like this,
> > > > > > > > > > > Say I've a web page having hindi content with 5%
> english
> > > > > content.
> > > > > > > > Then
> > > > > > > > > > for
> > > > > > > > > > > hindi I want to use the basic white space analyzer as
> we
> > > dont
> > > > > > have
> > > > > > > > > > stemmers
> > > > > > > > > > > for this as I mentioned earlier and whereever english
> > > appears
> > > > I
> > > > > > > want
> > > > > > > > > > them
> > > > > > > > > > > to
> > > > > > > > > > > be stemmed tokenized etc[the standard process used for
> > > > english
> > > > > > > > > content].
> > > > > > > > > > As
> > > > > > > > > > > of now I'm using whitespace analyzer for the full
> content
> > > > which
> > > > > > > > doesnot
> > > > > > > > > > > support case folding, stemming etc for teh content. So
> if
> > > > there
> > > > > > is
> > > > > > > an
> > > > > > > > > > > english word say "Detection" indexed as such then
> > searching
> > > > for
> > > > > > > > > > detection
> > > > > > > > > > > or
> > > > > > > > > > > detect is not giving any results, which is the expected
> > > > > behavior,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > > > want
> > > > > > > > > > > this kind of queries to give results.
> > > > > > > > > > > I hope I made it clear. Let me know any ideas on doing
> > the
> > > > > same.
> > > > > > > And
> > > > > > > > > one
> > > > > > > > > > > more thing, I'm storing the full webpage content under
> a
> > > > single
> > > > > > > > field,
> > > > > > > > > I
> > > > > > > > > > > hope this will not make any difference, right?
> > > > > > > > > > > It seems I've to use language identifiers, but do we
> > really
> > > > > need
> > > > > > > > that?
> > > > > > > > > > > Because we've only non-english content mixed with
> > > english[and
> > > > > not
> > > > > > > > > french
> > > > > > > > > > or
> > > > > > > > > > > russian etc].
> > > > > > > > > > >
> > > > > > > > > > > What is the best way of approaching the problem? Any
> > > > thoughts!
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > KK.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir <
> > > > rcm...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > KK, is all of your latin script text actually
> english?
> > Is
> > > > > there
> > > > > > > > stuff
> > > > > > > > > > > like
> > > > > > > > > > > > german or french mixed in?
> > > > > > > > > > > >
> > > > > > > > > > > > And for your non-english content (your examples have
> > been
> > > > > > indian
> > > > > > > > > > writing
> > > > > > > > > > > > systems), is it generally true that if you had
> > > devanagari,
> > > > > you
> > > > > > > can
> > > > > > > > > > assume
> > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in?
> > > > > > > > > > > >
> > > > > > > > > > > > Reason I say this is to invoke the right stemmers,
> you
> > > > really
> > > > > > > need
> > > > > > > > > > some
> > > > > > > > > > > > language detection, but perhaps in your case you can
> > > cheat
> > > > > and
> > > > > > > > detect
> > > > > > > > > > > this
> > > > > > > > > > > > based on scripts...
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Robert
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK <
> > > > > > dioxide.softw...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > I'm indexing some non-english content. But the page
> > > also
> > > > > > > contains
> > > > > > > > > > > english
> > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for
> > all
> > > > > > content
> > > > > > > > and
> > > > > > > > > > I'm
> > > > > > > > > > > > > storing the full webpage content under a single
> > filed.
> > > > Now
> > > > > we
> > > > > > > > > > require
> > > > > > > > > > > to
> > > > > > > > > > > > > support case folding and stemmming for the english
> > > > content
> > > > > > > > > > intermingled
> > > > > > > > > > > > > with
> > > > > > > > > > > > > non-english content. I must metion that we dont
> have
> > > > > stemming
> > > > > > > and
> > > > > > > > > > case
> > > > > > > > > > > > > folding for these non-english content. I'm stuck
> with
> > > > this.
> > > > > > > Some
> > > > > > > > > one
> > > > > > > > > > do
> > > > > > > > > > > > let
> > > > > > > > > > > > > me know how to proceed for fixing this issue.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > KK.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Robert Muir
> > > > > > > > > > > > rcm...@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Robert Muir
> > > > > > > > > > rcm...@gmail.com
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail:
> > > java-user-unsubscr...@lucene.apache.org
> > > > > > > > > For additional commands, e-mail:
> > > > java-user-h...@lucene.apache.org
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Robert Muir
> > > > > > > > rcm...@gmail.com
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Robert Muir
> > > > > > rcm...@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Robert Muir
> > > > rcm...@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: How to support stemming and case folding for english content mixed with non-english content?

Reply via email to