KK, you got the right idea. though I think you might want to change the order, move the stopfilter before the porter stem filter... otherwise it might not work correctly.
On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com> wrote: > Thanks Robert. This is exactly what I did and its working but delimiter is > missing I'm going to add that from solr-nightly.jar > > /** > * Analyzer for Indian language. > */ > public class IndicAnalyzer extends Analyzer { > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream ts = new WhitespaceTokenizer(reader); > ts = new PorterStemFilter(ts); > ts = new LowerCaseFilter(ts); > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); > return ts; > } > } > > Its able to do stemming/case-folding and supports search for both english > and indic texts. let me try out the delimiter. Will update you on that. > > Thanks a lot. > KK > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com> wrote: > > > i think you are on the right track... once you build your analyzer, put > it > > in your classpath and play around with it in luke and see if it does what > > you want. > > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com> wrote: > > > > > Hi Robert, > > > This is what I copied from ThaiAnalyzer @ lucene contrib > > > > > > public class ThaiAnalyzer extends Analyzer { > > > public TokenStream tokenStream(String fieldName, Reader reader) { > > > TokenStream ts = new StandardTokenizer(reader); > > > ts = new StandardFilter(ts); > > > ts = new ThaiWordFilter(ts); > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); > > > return ts; > > > } > > > } > > > > > > Now as you said, I've to use whitespacetokenizer > > > withworddelimitefilter[solr > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is something > > like > > > this, > > > public class IndicAnalyzer extends Analyzer { > > > public TokenStream tokenStream(String fieldName, Reader reader) { > > > TokenStream ts = new WhiteSpaceTokenizer(reader); > > > ts = new WordDelimiterFilter(ts); > > > ts = new LowerCaseFilter(ts); > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS) // english > > > stop filter, is this the default one? > > > ts = new PorterFilter(ts); > > > return ts; > > > } > > > } > > > > > > Does this sound OK? I think it will do the job...let me try it out.. > > > I dont need custom filter as per my requirement, at least not for these > > > basic things I'm doing? I think so... > > > > > > Thanks, > > > KK. > > > > > > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir <rcm...@gmail.com> wrote: > > > > > > > KK well you can always get some good examples from the lucene contrib > > > > codebase. > > > > For example, look at the DutchAnalyzer, especially: > > > > > > > > TokenStream tokenStream(String fieldName, Reader reader) > > > > > > > > See how it combines a specified tokenizer with various filters? this > is > > > > what > > > > you want to do, except of course you want to use different tokenizer > > and > > > > filters. > > > > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK <dioxide.softw...@gmail.com> > wrote: > > > > > > > > > Thanks Muir. > > > > > Thanks for letting me know that I dont need language identifiers. > > > > > I'll have a look and will try to write the analyzer. For my case I > > > think > > > > > it > > > > > wont be that difficult. > > > > > BTW, can you point me to some sample codes/tutorials writing custom > > > > > analyzers. I could not find something in LIA2ndEdn. Is something > > htere? > > > > do > > > > > let me know. > > > > > > > > > > Thanks, > > > > > KK. > > > > > > > > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir <rcm...@gmail.com> > > wrote: > > > > > > > > > > > KK, for your case, you don't really need to go to the effort of > > > > detecting > > > > > > whether fragments are english or not. > > > > > > Because the English stemmers in lucene will not modify your Indic > > > text, > > > > > and > > > > > > neither will the LowerCaseFilter. > > > > > > > > > > > > what you want to do is create a custom analyzer that works like > > this > > > > > > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from Solr nightly > > > jar], > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter- > > > > > > > > > > > > Thanks, > > > > > > Robert > > > > > > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK <dioxide.softw...@gmail.com> > > > wrote: > > > > > > > > > > > > > Thank you all. > > > > > > > To be frank I was using Solr in the begining half a month ago. > > The > > > > > > > problem[rather bug] with solr was creation of new index on the > > fly. > > > > > > Though > > > > > > > they have a restful method for teh same, but it was not > working. > > If > > > I > > > > > > > remember properly one of Solr commiter "Noble Paul"[I dont know > > his > > > > > real > > > > > > > name] was trying to help me. I tried many nightly builds and > > > spending > > > > a > > > > > > > couple of days stuck at that made me think of lucene and I > > switched > > > > to > > > > > > it. > > > > > > > Now after working with lucene which gives you full control of > > > > > everything > > > > > > I > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene is similar > > to > > > > > > > Window$:Linux, its my view only, though]. Coming back to the > > point > > > as > > > > > Uwe > > > > > > > mentioned that we can do the same thing in lucene as well, what > > is > > > > > > > available > > > > > > > in Solr, Solr is based on Lucene only, right? > > > > > > > I request Uwe to give me some more ideas on using the analyzers > > > from > > > > > solr > > > > > > > that will do the job for me, handling a mix of both english and > > > > > > non-english > > > > > > > content. > > > > > > > Muir, can you give me a bit detail description of how to use > the > > > > > > > WordDelimiteFilter to do my job. > > > > > > > On a side note, I was thingking of writing a simple analyzer > that > > > > will > > > > > do > > > > > > > the following, > > > > > > > #. If the webpage fragment is non-english[for me its some > indian > > > > > > language] > > > > > > > then index them as such, no stemming/ stop word removal to > begin > > > > with. > > > > > As > > > > > > I > > > > > > > know its in UCN unicode something like > > > \u0021\u0012\u34ae\u0031[just > > > > a > > > > > > > sample] > > > > > > > # If the fragment is english then apply standard anlyzing > process > > > for > > > > > > > english content. I've not thought of quering in the same way as > > of > > > > now > > > > > > i.e > > > > > > > mix of non-english and engish words. > > > > > > > Now to get all this, > > > > > > > #1. I need some sort of way which will let me know if the > > content > > > is > > > > > > > english or not. If not english just add the tokens to the > > document. > > > > Do > > > > > we > > > > > > > really need language identifiers, as i dont have any other > > content > > > > that > > > > > > > uses > > > > > > > the same script as english other than those \u1234 things for > my > > > > indian > > > > > > > language content. Any smart hack/trick for the same? > > > > > > > #2. If the its english apply all normal process and add the > > > stemmed > > > > > > token > > > > > > > to document. > > > > > > > For all this I was thinking of iterating earch word of the web > > page > > > > and > > > > > > > apply the above procedure. And finallyadd the newly created > > > document > > > > > to > > > > > > > the > > > > > > > index. > > > > > > > > > > > > > > I would like some one to guide me in this direction. I'm pretty > > > > people > > > > > > must > > > > > > > have done similar/same thing earlier, I request them to guide > me/ > > > > point > > > > > > me > > > > > > > to some tutorials for the same. > > > > > > > Else help me out writing a custom analyzer only if thats not > > going > > > to > > > > > be > > > > > > > too > > > > > > > complex. LOL, I'm a new user to lucene and know basics of Java > > > > coding. > > > > > > > Thank you very much. > > > > > > > > > > > > > > --KK. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > > yes this is true. for starters KK, might be good to startup > > solr > > > > and > > > > > > look > > > > > > > > at > > > > > > > > http://localhost:8983/solr/admin/analysis.jsp?highlight=on > > > > > > > > > > > > > > > > if you want to stick with lucene, the WordDelimiterFilter is > > the > > > > > piece > > > > > > > you > > > > > > > > will want for your text, mainly for punctuation but also for > > > format > > > > > > > > characters such as ZWJ/ZWNJ. > > > > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler < > u...@thetaphi.de > > > > > > > > wrote: > > > > > > > > > > > > > > > > > You can also re-use the solr analyzers, as far as I found > > out. > > > > > There > > > > > > is > > > > > > > > an > > > > > > > > > issue in jIRA/discussion on java-dev to merge them. > > > > > > > > > > > > > > > > > > ----- > > > > > > > > > Uwe Schindler > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > > > > > > http://www.thetaphi.de > > > > > > > > > eMail: u...@thetaphi.de > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com] > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM > > > > > > > > > > To: java-user@lucene.apache.org > > > > > > > > > > Subject: Re: How to support stemming and case folding for > > > > english > > > > > > > > content > > > > > > > > > > mixed with non-english content? > > > > > > > > > > > > > > > > > > > > KK, ok, so you only really want to stem the english. This > > is > > > > > good. > > > > > > > > > > > > > > > > > > > > Is it possible for you to consider using solr? solr's > > default > > > > > > > analyzer > > > > > > > > > for > > > > > > > > > > type 'text' will be good for your case. it will do the > > > > following > > > > > > > > > > 1. tokenize on whitespace > > > > > > > > > > 2. handle both indian language and english punctuation > > > > > > > > > > 3. lowercase the english. > > > > > > > > > > 4. stem the english. > > > > > > > > > > > > > > > > > > > > try a nightly build, > > > > > > > > > http://people.apache.org/builds/lucene/solr/nightly/ > > > > > > > > > > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK < > > > dioxide.softw...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Muir, thanks for your response. > > > > > > > > > > > I'm indexing indian language web pages which has got > > > descent > > > > > > amount > > > > > > > > of > > > > > > > > > > > english content mixed with therein. For the time being > > I'm > > > > not > > > > > > > going > > > > > > > > to > > > > > > > > > > use > > > > > > > > > > > any stemmers as we don't have standard stemmers for > > indian > > > > > > > languages > > > > > > > > . > > > > > > > > > > So > > > > > > > > > > > what I want to do is like this, > > > > > > > > > > > Say I've a web page having hindi content with 5% > english > > > > > content. > > > > > > > > Then > > > > > > > > > > for > > > > > > > > > > > hindi I want to use the basic white space analyzer as > we > > > dont > > > > > > have > > > > > > > > > > stemmers > > > > > > > > > > > for this as I mentioned earlier and whereever english > > > appears > > > > I > > > > > > > want > > > > > > > > > > them > > > > > > > > > > > to > > > > > > > > > > > be stemmed tokenized etc[the standard process used for > > > > english > > > > > > > > > content]. > > > > > > > > > > As > > > > > > > > > > > of now I'm using whitespace analyzer for the full > content > > > > which > > > > > > > > doesnot > > > > > > > > > > > support case folding, stemming etc for teh content. So > if > > > > there > > > > > > is > > > > > > > an > > > > > > > > > > > english word say "Detection" indexed as such then > > searching > > > > for > > > > > > > > > > detection > > > > > > > > > > > or > > > > > > > > > > > detect is not giving any results, which is the expected > > > > > behavior, > > > > > > > but > > > > > > > > I > > > > > > > > > > > want > > > > > > > > > > > this kind of queries to give results. > > > > > > > > > > > I hope I made it clear. Let me know any ideas on doing > > the > > > > > same. > > > > > > > And > > > > > > > > > one > > > > > > > > > > > more thing, I'm storing the full webpage content under > a > > > > single > > > > > > > > field, > > > > > > > > > I > > > > > > > > > > > hope this will not make any difference, right? > > > > > > > > > > > It seems I've to use language identifiers, but do we > > really > > > > > need > > > > > > > > that? > > > > > > > > > > > Because we've only non-english content mixed with > > > english[and > > > > > not > > > > > > > > > french > > > > > > > > > > or > > > > > > > > > > > russian etc]. > > > > > > > > > > > > > > > > > > > > > > What is the best way of approaching the problem? Any > > > > thoughts! > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > KK. > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir < > > > > rcm...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > KK, is all of your latin script text actually > english? > > Is > > > > > there > > > > > > > > stuff > > > > > > > > > > > like > > > > > > > > > > > > german or french mixed in? > > > > > > > > > > > > > > > > > > > > > > > > And for your non-english content (your examples have > > been > > > > > > indian > > > > > > > > > > writing > > > > > > > > > > > > systems), is it generally true that if you had > > > devanagari, > > > > > you > > > > > > > can > > > > > > > > > > assume > > > > > > > > > > > > its hindi? or is there stuff like marathi mixed in? > > > > > > > > > > > > > > > > > > > > > > > > Reason I say this is to invoke the right stemmers, > you > > > > really > > > > > > > need > > > > > > > > > > some > > > > > > > > > > > > language detection, but perhaps in your case you can > > > cheat > > > > > and > > > > > > > > detect > > > > > > > > > > > this > > > > > > > > > > > > based on scripts... > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Robert > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK < > > > > > > dioxide.softw...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > I'm indexing some non-english content. But the page > > > also > > > > > > > contains > > > > > > > > > > > english > > > > > > > > > > > > > content. As of now I'm using WhitespaceAnalyzer for > > all > > > > > > content > > > > > > > > and > > > > > > > > > > I'm > > > > > > > > > > > > > storing the full webpage content under a single > > filed. > > > > Now > > > > > we > > > > > > > > > > require > > > > > > > > > > > to > > > > > > > > > > > > > support case folding and stemmming for the english > > > > content > > > > > > > > > > intermingled > > > > > > > > > > > > > with > > > > > > > > > > > > > non-english content. I must metion that we dont > have > > > > > stemming > > > > > > > and > > > > > > > > > > case > > > > > > > > > > > > > folding for these non-english content. I'm stuck > with > > > > this. > > > > > > > Some > > > > > > > > > one > > > > > > > > > > do > > > > > > > > > > > > let > > > > > > > > > > > > > me know how to proceed for fixing this issue. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > KK. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Robert Muir > > > > > > > > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Robert Muir > > > > > > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > To unsubscribe, e-mail: > > > java-user-unsubscr...@lucene.apache.org > > > > > > > > > For additional commands, e-mail: > > > > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Robert Muir > > > > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Robert Muir > > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Robert Muir > > > > rcm...@gmail.com > > > > > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com