KK can you give me an example of some indian text for which it is doing this?
Thanks! On Mon, Jun 8, 2009 at 1:03 AM, KK<dioxide.softw...@gmail.com> wrote: > Hi Robert, > The problem is that worddelimiterfilter is doing its job for english content > but for non-english indian content which are unicoded it highlights the > searched word but alongwith that it also highlights the characters of that > word which was not hapenning without worddelimfilter, thats my concern. Say > for example I searched for a hindi word say "xyz ab" [assume these are in > hindi] then in the search results it highlights these words but it also > highlights x/y/z/a/b whereever these letters appear which is obiviously > sounds bad. it should only highlight words not the letters therein. I hope I > made it clear. What could be the reason for this? Any idea on fixing the > same. > > Thanks, > KK > > On Sat, Jun 6, 2009 at 9:45 PM, Robert Muir <rcm...@gmail.com> wrote: > >> kk, i haven't had that experience with worddelimiterfilter on indian >> languages, is it possible you could provide me an example of how its >> creating nuisance? >> >> On Sat, Jun 6, 2009 at 9:42 AM, KK<dioxide.softw...@gmail.com> wrote: >> > Robert, I tried to use worddelimiterfilter from solr-nightly by putting >> it >> > in my working directory for this project, I set the parameters as you >> told >> > me. I must accept that its splitting words around those chars[like . @ >> > etc]but alongwith that its messing with other non-english/unicode >> contents >> > and thats creating nuisance. I dont want worddelimiterfilter to fiddle >> > around with my non-english content. >> > This is what I'm doing, >> > /** >> > * Analyzer for Indian language. >> > */ >> > public class IndicAnalyzer extends Analyzer { >> > public TokenStream tokenStream(String fieldName, Reader reader) { >> > TokenStream ts = new WhitespaceTokenizer(reader); >> > ts = new WordDelimiterFilter(ts, 1, 1, 1, 1, 0); >> > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); >> > ts = new LowerCaseFilter(ts); >> > ts = new PorterStemFilter(ts); >> > return ts; >> > } >> > } >> > >> > I've to use the deprecated API for setting 5 values, thats fine, but >> somehow >> > its messing with unicode content. How to get rid of that? Any thougts? It >> > seems setting those values is some proper way might fix the problem, I'm >> not >> > sure, though. >> > >> > Thanks, >> > KK. >> > >> > >> > On Fri, Jun 5, 2009 at 7:37 PM, Robert Muir <rcm...@gmail.com> wrote: >> > >> >> kk an easier solution to your first problem is to use >> >> worddelimiterfilterfactory if possible... you can get an instance of >> >> worddelimiter filter from that. >> >> >> >> thanks, >> >> robert >> >> >> >> On Fri, Jun 5, 2009 at 10:06 AM, Robert Muir<rcm...@gmail.com> wrote: >> >> > kk as for your first issue, that WordDelimiterFilter is package >> >> > protected, one option is to make a copy of the code and change the >> >> > class declaration to public. >> >> > the other option is to put your entire analyzer in >> >> > 'org.apache.solr.analysis' package so that you can access it... >> >> > >> >> > for the 2nd issue, yes you need to supply some options to it. the >> >> > default options solr applies to type 'text' seemed to work well for me >> >> > with indic: >> >> > >> >> > {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, >> >> > generateWordParts=1, catenateAll=0, catenateNumbers=1} >> >> > >> >> > On Fri, Jun 5, 2009 at 9:12 AM, KK <dioxide.softw...@gmail.com> >> wrote: >> >> >> >> >> >> Thanks Robert. There is one problem though, I'm able to plugin the >> word >> >> >> delimiter filter from solr-nightly jar file. When I tried to do >> >> something >> >> >> like, >> >> >> TokenStream ts = new WhitespaceTokenizer(reader); >> >> >> ts = new WordDelimiterFilter(ts); >> >> >> ts = new PorterStemmerFilter(ts); >> >> >> ...rest as in the last mail... >> >> >> >> >> >> It gave me an error saying that >> >> >> >> >> >> org.apache.solr.analysis.WordDelimiterFilter is not public in >> >> >> org.apache.solr.analysis; cannot be accessed from outside package >> >> >> import org.apache.solr.analysis.WordDelimiterFilter; >> >> >> ^ >> >> >> solrSearch/IndicAnalyzer.java:38: cannot find symbol >> >> >> symbol : class WordDelimiterFilter >> >> >> location: class solrSearch.IndicAnalyzer >> >> >> ts = new WordDelimiterFilter(ts); >> >> >> ^ >> >> >> 2 errors >> >> >> >> >> >> Then i tried to see the code for worddelimitefiter from solrnightly >> src >> >> and >> >> >> found that there are many deprecated constructors though they require >> a >> >> lot >> >> >> of parameters alongwith tokenstream. I went through the solr wiki for >> >> >> worddelimiterfilterfactory here, >> >> >> >> >> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089 >> >> >> and say that there also its specified that we've to mention the >> >> parameters >> >> >> and both are different for indexing and querying. >> >> >> I'm kind of stuck here, how do I make use of worddelimiterfilter in >> my >> >> >> custom analyzer, I've to use it anyway. >> >> >> In my code I've to make use of worddelimiterfilter and not >> >> >> worddelimiterfilterfactory, right? I don't know whats the use of the >> >> other >> >> >> one. Anyway can you guide me getting rid of the above error. And yes >> >> I'll >> >> >> change the order of applying the filters as you said. >> >> >> >> >> >> Thanks, >> >> >> KK. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Fri, Jun 5, 2009 at 5:48 PM, Robert Muir <rcm...@gmail.com> >> wrote: >> >> >> >> >> >> > KK, you got the right idea. >> >> >> > >> >> >> > though I think you might want to change the order, move the >> stopfilter >> >> >> > before the porter stem filter... otherwise it might not work >> >> correctly. >> >> >> > >> >> >> > On Fri, Jun 5, 2009 at 8:05 AM, KK <dioxide.softw...@gmail.com> >> >> wrote: >> >> >> > >> >> >> > > Thanks Robert. This is exactly what I did and its working but >> >> delimiter >> >> >> > is >> >> >> > > missing I'm going to add that from solr-nightly.jar >> >> >> > > >> >> >> > > /** >> >> >> > > * Analyzer for Indian language. >> >> >> > > */ >> >> >> > > public class IndicAnalyzer extends Analyzer { >> >> >> > > public TokenStream tokenStream(String fieldName, Reader reader) >> { >> >> >> > > TokenStream ts = new WhitespaceTokenizer(reader); >> >> >> > > ts = new PorterStemFilter(ts); >> >> >> > > ts = new LowerCaseFilter(ts); >> >> >> > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); >> >> >> > > return ts; >> >> >> > > } >> >> >> > > } >> >> >> > > >> >> >> > > Its able to do stemming/case-folding and supports search for both >> >> english >> >> >> > > and indic texts. let me try out the delimiter. Will update you on >> >> that. >> >> >> > > >> >> >> > > Thanks a lot. >> >> >> > > KK >> >> >> > > >> >> >> > > On Fri, Jun 5, 2009 at 5:30 PM, Robert Muir <rcm...@gmail.com> >> >> wrote: >> >> >> > > >> >> >> > > > i think you are on the right track... once you build your >> >> analyzer, put >> >> >> > > it >> >> >> > > > in your classpath and play around with it in luke and see if it >> >> does >> >> >> > what >> >> >> > > > you want. >> >> >> > > > >> >> >> > > > On Fri, Jun 5, 2009 at 3:19 AM, KK <dioxide.softw...@gmail.com >> > >> >> wrote: >> >> >> > > > >> >> >> > > > > Hi Robert, >> >> >> > > > > This is what I copied from ThaiAnalyzer @ lucene contrib >> >> >> > > > > >> >> >> > > > > public class ThaiAnalyzer extends Analyzer { >> >> >> > > > > public TokenStream tokenStream(String fieldName, Reader >> reader) >> >> { >> >> >> > > > > TokenStream ts = new StandardTokenizer(reader); >> >> >> > > > > ts = new StandardFilter(ts); >> >> >> > > > > ts = new ThaiWordFilter(ts); >> >> >> > > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS); >> >> >> > > > > return ts; >> >> >> > > > > } >> >> >> > > > > } >> >> >> > > > > >> >> >> > > > > Now as you said, I've to use whitespacetokenizer >> >> >> > > > > withworddelimitefilter[solr >> >> >> > > > > nightly.jar] stop wordremoval, porter stemmer etc , so it is >> >> >> > something >> >> >> > > > like >> >> >> > > > > this, >> >> >> > > > > public class IndicAnalyzer extends Analyzer { >> >> >> > > > > public TokenStream tokenStream(String fieldName, Reader >> reader) >> >> { >> >> >> > > > > TokenStream ts = new WhiteSpaceTokenizer(reader); >> >> >> > > > > ts = new WordDelimiterFilter(ts); >> >> >> > > > > ts = new LowerCaseFilter(ts); >> >> >> > > > > ts = new StopFilter(ts, StopAnalyzer.ENGLISH_STOP_WORDS) >> // >> >> >> > english >> >> >> > > > > stop filter, is this the default one? >> >> >> > > > > ts = new PorterFilter(ts); >> >> >> > > > > return ts; >> >> >> > > > > } >> >> >> > > > > } >> >> >> > > > > >> >> >> > > > > Does this sound OK? I think it will do the job...let me try >> it >> >> out.. >> >> >> > > > > I dont need custom filter as per my requirement, at least not >> >> for >> >> >> > these >> >> >> > > > > basic things I'm doing? I think so... >> >> >> > > > > >> >> >> > > > > Thanks, >> >> >> > > > > KK. >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > On Thu, Jun 4, 2009 at 6:36 PM, Robert Muir < >> rcm...@gmail.com> >> >> >> > wrote: >> >> >> > > > > >> >> >> > > > > > KK well you can always get some good examples from the >> lucene >> >> >> > contrib >> >> >> > > > > > codebase. >> >> >> > > > > > For example, look at the DutchAnalyzer, especially: >> >> >> > > > > > >> >> >> > > > > > TokenStream tokenStream(String fieldName, Reader reader) >> >> >> > > > > > >> >> >> > > > > > See how it combines a specified tokenizer with various >> >> filters? >> >> >> > this >> >> >> > > is >> >> >> > > > > > what >> >> >> > > > > > you want to do, except of course you want to use different >> >> >> > tokenizer >> >> >> > > > and >> >> >> > > > > > filters. >> >> >> > > > > > >> >> >> > > > > > On Thu, Jun 4, 2009 at 8:53 AM, KK < >> >> dioxide.softw...@gmail.com> >> >> >> > > wrote: >> >> >> > > > > > >> >> >> > > > > > > Thanks Muir. >> >> >> > > > > > > Thanks for letting me know that I dont need language >> >> identifiers. >> >> >> > > > > > > I'll have a look and will try to write the analyzer. For >> my >> >> case >> >> >> > I >> >> >> > > > > think >> >> >> > > > > > > it >> >> >> > > > > > > wont be that difficult. >> >> >> > > > > > > BTW, can you point me to some sample codes/tutorials >> writing >> >> >> > custom >> >> >> > > > > > > analyzers. I could not find something in LIA2ndEdn. Is >> >> something >> >> >> > > > htere? >> >> >> > > > > > do >> >> >> > > > > > > let me know. >> >> >> > > > > > > >> >> >> > > > > > > Thanks, >> >> >> > > > > > > KK. >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > On Thu, Jun 4, 2009 at 6:19 PM, Robert Muir < >> >> rcm...@gmail.com> >> >> >> > > > wrote: >> >> >> > > > > > > >> >> >> > > > > > > > KK, for your case, you don't really need to go to the >> >> effort of >> >> >> > > > > > detecting >> >> >> > > > > > > > whether fragments are english or not. >> >> >> > > > > > > > Because the English stemmers in lucene will not modify >> >> your >> >> >> > Indic >> >> >> > > > > text, >> >> >> > > > > > > and >> >> >> > > > > > > > neither will the LowerCaseFilter. >> >> >> > > > > > > > >> >> >> > > > > > > > what you want to do is create a custom analyzer that >> works >> >> like >> >> >> > > > this >> >> >> > > > > > > > >> >> >> > > > > > > > -WhitespaceTokenizer with WordDelimiterFilter [from >> Solr >> >> >> > nightly >> >> >> > > > > jar], >> >> >> > > > > > > > LowerCaseFilter, StopFilter, and PorterStemFilter- >> >> >> > > > > > > > >> >> >> > > > > > > > Thanks, >> >> >> > > > > > > > Robert >> >> >> > > > > > > > >> >> >> > > > > > > > On Thu, Jun 4, 2009 at 8:28 AM, KK < >> >> dioxide.softw...@gmail.com >> >> >> > > >> >> >> > > > > wrote: >> >> >> > > > > > > > >> >> >> > > > > > > > > Thank you all. >> >> >> > > > > > > > > To be frank I was using Solr in the begining half a >> >> month >> >> >> > ago. >> >> >> > > > The >> >> >> > > > > > > > > problem[rather bug] with solr was creation of new >> index >> >> on >> >> >> > the >> >> >> > > > fly. >> >> >> > > > > > > > Though >> >> >> > > > > > > > > they have a restful method for teh same, but it was >> not >> >> >> > > working. >> >> >> > > > If >> >> >> > > > > I >> >> >> > > > > > > > > remember properly one of Solr commiter "Noble Paul"[I >> >> dont >> >> >> > know >> >> >> > > > his >> >> >> > > > > > > real >> >> >> > > > > > > > > name] was trying to help me. I tried many nightly >> builds >> >> and >> >> >> > > > > spending >> >> >> > > > > > a >> >> >> > > > > > > > > couple of days stuck at that made me think of lucene >> and >> >> I >> >> >> > > > switched >> >> >> > > > > > to >> >> >> > > > > > > > it. >> >> >> > > > > > > > > Now after working with lucene which gives you full >> >> control of >> >> >> > > > > > > everything >> >> >> > > > > > > > I >> >> >> > > > > > > > > don't want to switch to Solr.[LOL, to me Solr:Lucene >> is >> >> >> > similar >> >> >> > > > to >> >> >> > > > > > > > > Window$:Linux, its my view only, though]. Coming back >> to >> >> the >> >> >> > > > point >> >> >> > > > > as >> >> >> > > > > > > Uwe >> >> >> > > > > > > > > mentioned that we can do the same thing in lucene as >> >> well, >> >> >> > what >> >> >> > > > is >> >> >> > > > > > > > > available >> >> >> > > > > > > > > in Solr, Solr is based on Lucene only, right? >> >> >> > > > > > > > > I request Uwe to give me some more ideas on using the >> >> >> > analyzers >> >> >> > > > > from >> >> >> > > > > > > solr >> >> >> > > > > > > > > that will do the job for me, handling a mix of both >> >> english >> >> >> > and >> >> >> > > > > > > > non-english >> >> >> > > > > > > > > content. >> >> >> > > > > > > > > Muir, can you give me a bit detail description of how >> to >> >> use >> >> >> > > the >> >> >> > > > > > > > > WordDelimiteFilter to do my job. >> >> >> > > > > > > > > On a side note, I was thingking of writing a simple >> >> analyzer >> >> >> > > that >> >> >> > > > > > will >> >> >> > > > > > > do >> >> >> > > > > > > > > the following, >> >> >> > > > > > > > > #. If the webpage fragment is non-english[for me its >> >> some >> >> >> > > indian >> >> >> > > > > > > > language] >> >> >> > > > > > > > > then index them as such, no stemming/ stop word >> removal >> >> to >> >> >> > > begin >> >> >> > > > > > with. >> >> >> > > > > > > As >> >> >> > > > > > > > I >> >> >> > > > > > > > > know its in UCN unicode something like >> >> >> > > > > \u0021\u0012\u34ae\u0031[just >> >> >> > > > > > a >> >> >> > > > > > > > > sample] >> >> >> > > > > > > > > # If the fragment is english then apply standard >> >> anlyzing >> >> >> > > process >> >> >> > > > > for >> >> >> > > > > > > > > english content. I've not thought of quering in the >> same >> >> way >> >> >> > as >> >> >> > > > of >> >> >> > > > > > now >> >> >> > > > > > > > i.e >> >> >> > > > > > > > > mix of non-english and engish words. >> >> >> > > > > > > > > Now to get all this, >> >> >> > > > > > > > > #1. I need some sort of way which will let me know >> if >> >> the >> >> >> > > > content >> >> >> > > > > is >> >> >> > > > > > > > > english or not. If not english just add the tokens to >> >> the >> >> >> > > > document. >> >> >> > > > > > Do >> >> >> > > > > > > we >> >> >> > > > > > > > > really need language identifiers, as i dont have any >> >> other >> >> >> > > > content >> >> >> > > > > > that >> >> >> > > > > > > > > uses >> >> >> > > > > > > > > the same script as english other than those \u1234 >> >> things for >> >> >> > > my >> >> >> > > > > > indian >> >> >> > > > > > > > > language content. Any smart hack/trick for the same? >> >> >> > > > > > > > > #2. If the its english apply all normal process and >> add >> >> the >> >> >> > > > > stemmed >> >> >> > > > > > > > token >> >> >> > > > > > > > > to document. >> >> >> > > > > > > > > For all this I was thinking of iterating earch word >> of >> >> the >> >> >> > web >> >> >> > > > page >> >> >> > > > > > and >> >> >> > > > > > > > > apply the above procedure. And finallyadd the newly >> >> created >> >> >> > > > > document >> >> >> > > > > > > to >> >> >> > > > > > > > > the >> >> >> > > > > > > > > index. >> >> >> > > > > > > > > >> >> >> > > > > > > > > I would like some one to guide me in this direction. >> I'm >> >> >> > pretty >> >> >> > > > > > people >> >> >> > > > > > > > must >> >> >> > > > > > > > > have done similar/same thing earlier, I request them >> to >> >> guide >> >> >> > > me/ >> >> >> > > > > > point >> >> >> > > > > > > > me >> >> >> > > > > > > > > to some tutorials for the same. >> >> >> > > > > > > > > Else help me out writing a custom analyzer only if >> thats >> >> not >> >> >> > > > going >> >> >> > > > > to >> >> >> > > > > > > be >> >> >> > > > > > > > > too >> >> >> > > > > > > > > complex. LOL, I'm a new user to lucene and know >> basics >> >> of >> >> >> > Java >> >> >> > > > > > coding. >> >> >> > > > > > > > > Thank you very much. >> >> >> > > > > > > > > >> >> >> > > > > > > > > --KK. >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > On Thu, Jun 4, 2009 at 5:30 PM, Robert Muir < >> >> >> > rcm...@gmail.com> >> >> >> > > > > > wrote: >> >> >> > > > > > > > > >> >> >> > > > > > > > > > yes this is true. for starters KK, might be good to >> >> startup >> >> >> > > > solr >> >> >> > > > > > and >> >> >> > > > > > > > look >> >> >> > > > > > > > > > at >> >> >> > > > > > > > > > >> >> http://localhost:8983/solr/admin/analysis.jsp?highlight=on >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > if you want to stick with lucene, the >> >> WordDelimiterFilter >> >> >> > is >> >> >> > > > the >> >> >> > > > > > > piece >> >> >> > > > > > > > > you >> >> >> > > > > > > > > > will want for your text, mainly for punctuation but >> >> also >> >> >> > for >> >> >> > > > > format >> >> >> > > > > > > > > > characters such as ZWJ/ZWNJ. >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > On Thu, Jun 4, 2009 at 7:51 AM, Uwe Schindler < >> >> >> > > u...@thetaphi.de >> >> >> > > > > >> >> >> > > > > > > wrote: >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > > You can also re-use the solr analyzers, as far as >> I >> >> found >> >> >> > > > out. >> >> >> > > > > > > There >> >> >> > > > > > > > is >> >> >> > > > > > > > > > an >> >> >> > > > > > > > > > > issue in jIRA/discussion on java-dev to merge >> them. >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > ----- >> >> >> > > > > > > > > > > Uwe Schindler >> >> >> > > > > > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen >> >> >> > > > > > > > > > > http://www.thetaphi.de >> >> >> > > > > > > > > > > eMail: u...@thetaphi.de >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > > -----Original Message----- >> >> >> > > > > > > > > > > > From: Robert Muir [mailto:rcm...@gmail.com] >> >> >> > > > > > > > > > > > Sent: Thursday, June 04, 2009 1:18 PM >> >> >> > > > > > > > > > > > To: java-user@lucene.apache.org >> >> >> > > > > > > > > > > > Subject: Re: How to support stemming and case >> >> folding >> >> >> > for >> >> >> > > > > > english >> >> >> > > > > > > > > > content >> >> >> > > > > > > > > > > > mixed with non-english content? >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > KK, ok, so you only really want to stem the >> >> english. >> >> >> > This >> >> >> > > > is >> >> >> > > > > > > good. >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > Is it possible for you to consider using solr? >> >> solr's >> >> >> > > > default >> >> >> > > > > > > > > analyzer >> >> >> > > > > > > > > > > for >> >> >> > > > > > > > > > > > type 'text' will be good for your case. it will >> do >> >> the >> >> >> > > > > > following >> >> >> > > > > > > > > > > > 1. tokenize on whitespace >> >> >> > > > > > > > > > > > 2. handle both indian language and english >> >> punctuation >> >> >> > > > > > > > > > > > 3. lowercase the english. >> >> >> > > > > > > > > > > > 4. stem the english. >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > try a nightly build, >> >> >> > > > > > > > > > > >> >> http://people.apache.org/builds/lucene/solr/nightly/ >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > On Thu, Jun 4, 2009 at 1:12 AM, KK < >> >> >> > > > > dioxide.softw...@gmail.com >> >> >> > > > > > > >> >> >> > > > > > > > > wrote: >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > > Muir, thanks for your response. >> >> >> > > > > > > > > > > > > I'm indexing indian language web pages which >> has >> >> got >> >> >> > > > > descent >> >> >> > > > > > > > amount >> >> >> > > > > > > > > > of >> >> >> > > > > > > > > > > > > english content mixed with therein. For the >> time >> >> >> > being >> >> >> > > > I'm >> >> >> > > > > > not >> >> >> > > > > > > > > going >> >> >> > > > > > > > > > to >> >> >> > > > > > > > > > > > use >> >> >> > > > > > > > > > > > > any stemmers as we don't have standard >> stemmers >> >> for >> >> >> > > > indian >> >> >> > > > > > > > > languages >> >> >> > > > > > > > > > . >> >> >> > > > > > > > > > > > So >> >> >> > > > > > > > > > > > > what I want to do is like this, >> >> >> > > > > > > > > > > > > Say I've a web page having hindi content with >> 5% >> >> >> > > english >> >> >> > > > > > > content. >> >> >> > > > > > > > > > Then >> >> >> > > > > > > > > > > > for >> >> >> > > > > > > > > > > > > hindi I want to use the basic white space >> >> analyzer as >> >> >> > > we >> >> >> > > > > dont >> >> >> > > > > > > > have >> >> >> > > > > > > > > > > > stemmers >> >> >> > > > > > > > > > > > > for this as I mentioned earlier and whereever >> >> english >> >> >> > > > > appears >> >> >> > > > > > I >> >> >> > > > > > > > > want >> >> >> > > > > > > > > > > > them >> >> >> > > > > > > > > > > > > to >> >> >> > > > > > > > > > > > > be stemmed tokenized etc[the standard process >> >> used >> >> >> > for >> >> >> > > > > > english >> >> >> > > > > > > > > > > content]. >> >> >> > > > > > > > > > > > As >> >> >> > > > > > > > > > > > > of now I'm using whitespace analyzer for the >> >> full >> >> >> > > content >> >> >> > > > > > which >> >> >> > > > > > > > > > doesnot >> >> >> > > > > > > > > > > > > support case folding, stemming etc for teh >> >> content. >> >> >> > So >> >> >> > > if >> >> >> > > > > > there >> >> >> > > > > > > > is >> >> >> > > > > > > > > an >> >> >> > > > > > > > > > > > > english word say "Detection" indexed as such >> >> then >> >> >> > > > searching >> >> >> > > > > > for >> >> >> > > > > > > > > > > > detection >> >> >> > > > > > > > > > > > > or >> >> >> > > > > > > > > > > > > detect is not giving any results, which is >> the >> >> >> > expected >> >> >> > > > > > > behavior, >> >> >> > > > > > > > > but >> >> >> > > > > > > > > > I >> >> >> > > > > > > > > > > > > want >> >> >> > > > > > > > > > > > > this kind of queries to give results. >> >> >> > > > > > > > > > > > > I hope I made it clear. Let me know any ideas >> on >> >> >> > doing >> >> >> > > > the >> >> >> > > > > > > same. >> >> >> > > > > > > > > And >> >> >> > > > > > > > > > > one >> >> >> > > > > > > > > > > > > more thing, I'm storing the full webpage >> content >> >> >> > under >> >> >> > > a >> >> >> > > > > > single >> >> >> > > > > > > > > > field, >> >> >> > > > > > > > > > > I >> >> >> > > > > > > > > > > > > hope this will not make any difference, >> right? >> >> >> > > > > > > > > > > > > It seems I've to use language identifiers, >> but >> >> do we >> >> >> > > > really >> >> >> > > > > > > need >> >> >> > > > > > > > > > that? >> >> >> > > > > > > > > > > > > Because we've only non-english content mixed >> >> with >> >> >> > > > > english[and >> >> >> > > > > > > not >> >> >> > > > > > > > > > > french >> >> >> > > > > > > > > > > > or >> >> >> > > > > > > > > > > > > russian etc]. >> >> >> > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > What is the best way of approaching the >> problem? >> >> Any >> >> >> > > > > > thoughts! >> >> >> > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > Thanks, >> >> >> > > > > > > > > > > > > KK. >> >> >> > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > On Wed, Jun 3, 2009 at 9:42 PM, Robert Muir < >> >> >> > > > > > rcm...@gmail.com> >> >> >> > > > > > > > > > wrote: >> >> >> > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > KK, is all of your latin script text >> actually >> >> >> > > english? >> >> >> > > > Is >> >> >> > > > > > > there >> >> >> > > > > > > > > > stuff >> >> >> > > > > > > > > > > > > like >> >> >> > > > > > > > > > > > > > german or french mixed in? >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > And for your non-english content (your >> >> examples >> >> >> > have >> >> >> > > > been >> >> >> > > > > > > > indian >> >> >> > > > > > > > > > > > writing >> >> >> > > > > > > > > > > > > > systems), is it generally true that if you >> had >> >> >> > > > > devanagari, >> >> >> > > > > > > you >> >> >> > > > > > > > > can >> >> >> > > > > > > > > > > > assume >> >> >> > > > > > > > > > > > > > its hindi? or is there stuff like marathi >> >> mixed in? >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > Reason I say this is to invoke the right >> >> stemmers, >> >> >> > > you >> >> >> > > > > > really >> >> >> > > > > > > > > need >> >> >> > > > > > > > > > > > some >> >> >> > > > > > > > > > > > > > language detection, but perhaps in your >> case >> >> you >> >> >> > can >> >> >> > > > > cheat >> >> >> > > > > > > and >> >> >> > > > > > > > > > detect >> >> >> > > > > > > > > > > > > this >> >> >> > > > > > > > > > > > > > based on scripts... >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > Thanks, >> >> >> > > > > > > > > > > > > > Robert >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > On Wed, Jun 3, 2009 at 10:15 AM, KK < >> >> >> > > > > > > > dioxide.softw...@gmail.com> >> >> >> > > > > > > > > > > > wrote: >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > > Hi All, >> >> >> > > > > > > > > > > > > > > I'm indexing some non-english content. >> But >> >> the >> >> >> > page >> >> >> > > > > also >> >> >> > > > > > > > > contains >> >> >> > > > > > > > > > > > > english >> >> >> > > > > > > > > > > > > > > content. As of now I'm using >> >> WhitespaceAnalyzer >> >> >> > for >> >> >> > > > all >> >> >> > > > > > > > content >> >> >> > > > > > > > > > and >> >> >> > > > > > > > > > > > I'm >> >> >> > > > > > > > > > > > > > > storing the full webpage content under a >> >> single >> >> >> > > > filed. >> >> >> > > > > > Now >> >> >> > > > > > > we >> >> >> > > > > > > > > > > > require >> >> >> > > > > > > > > > > > > to >> >> >> > > > > > > > > > > > > > > support case folding and stemmming for >> the >> >> >> > english >> >> >> > > > > > content >> >> >> > > > > > > > > > > > intermingled >> >> >> > > > > > > > > > > > > > > with >> >> >> > > > > > > > > > > > > > > non-english content. I must metion that >> we >> >> dont >> >> >> > > have >> >> >> > > > > > > stemming >> >> >> > > > > > > > > and >> >> >> > > > > > > > > > > > case >> >> >> > > > > > > > > > > > > > > folding for these non-english content. >> I'm >> >> stuck >> >> >> > > with >> >> >> > > > > > this. >> >> >> > > > > > > > > Some >> >> >> > > > > > > > > > > one >> >> >> > > > > > > > > > > > do >> >> >> > > > > > > > > > > > > > let >> >> >> > > > > > > > > > > > > > > me know how to proceed for fixing this >> >> issue. >> >> >> > > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > > Thanks, >> >> >> > > > > > > > > > > > > > > KK. >> >> >> > > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > > -- >> >> >> > > > > > > > > > > > > > Robert Muir >> >> >> > > > > > > > > > > > > > rcm...@gmail.com >> >> >> > > > > > > > > > > > > > >> >> >> > > > > > > > > > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > >> >> >> > > > > > > > > > > > -- >> >> >> > > > > > > > > > > > Robert Muir >> >> >> > > > > > > > > > > > rcm...@gmail.com >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > >> >> >> > > >> >> --------------------------------------------------------------------- >> >> >> > > > > > > > > > > To unsubscribe, e-mail: >> >> >> > > > > java-user-unsubscr...@lucene.apache.org >> >> >> > > > > > > > > > > For additional commands, e-mail: >> >> >> > > > > > java-user-h...@lucene.apache.org >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > -- >> >> >> > > > > > > > > > Robert Muir >> >> >> > > > > > > > > > rcm...@gmail.com >> >> >> > > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > -- >> >> >> > > > > > > > Robert Muir >> >> >> > > > > > > > rcm...@gmail.com >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > > -- >> >> >> > > > > > Robert Muir >> >> >> > > > > > rcm...@gmail.com >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > -- >> >> >> > > > Robert Muir >> >> >> > > > rcm...@gmail.com >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Robert Muir >> >> >> > rcm...@gmail.com >> >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Robert Muir >> >> > rcm...@gmail.com >> >> > >> >> >> >> >> >> >> >> -- >> >> Robert Muir >> >> rcm...@gmail.com >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org