Does anyone know what approach does Nutch uses?
-----Original Message----- From: Hacking Bear [mailto:[EMAIL PROTECTED] Sent: 06 September 2005 12:15 To: java-user@lucene.apache.org Subject: Re: Multiple Language Indexing and Searching On 9/6/05, Olivier Jaquemet <[EMAIL PROTECTED]> wrote: > > As far as your usage is concerned, it seems to be the right approach, > and I think the StandardAnalyzer does the job pretty right when it has > to deal with whatever language you want. I should look into exactly what it does. Does this StandardAnalyzer handle non-European languages like Chinese? Though, note that it won't deal with all languages' stop words but the > English ones, unless specified at index time But then if you change the > stop words at index time, what should you use at query time, some query > it won't work well. I think we can easily create our own super stop-word lists by copying from whatever other language's stop word lists we can find. But as far as I am concerned, each content (content in the sense of a > CMS) is known to have multiple language, and each of these language > *can* be indexed separately with no problem at all, and therefore a > dedicated analyser could be use. So I was wondering whether my approach > could be the right one of if it was over complex, and could introduce > some problem I could not see... (My approach being: one index per > language) My suggestion would be to create one index for all languages with each document having a 'lang' attribute. Lucene is quite scalable right? So this should not be an issue. During search, you can either default to turn on the 'lang' attribute condition or default to off, depending on what your users want most often. But it will be very easy to search multiple language documents. > I don't know if the developpers of lucene would agree, but from what > I've been browsing on the ML archives, those multiple language issues > seems to arrise quite often in the mailing list, and maybe some articles > like "best practices", "do's and don'ts" or "Lucene Architecture in > multiple language environement", might be really nice to see :) If some > of you have the time and the experience to write them I'll be really > thankful! :) What keywords do you use to search? Somehow, I cannot find any discussion about multiple language on the ML archive. I even did Google! :-) Or maybe I was giving the keywords in the wrong language? :-) --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]