There are two problems: a) Currently there is no such analyzer (I have the problem, too, I would also like to autodetect the language from a text like M$ Word does and switch the analyzers). b) If such an autodetect analyzer exists, you will have a problem on the searching side, because you should almost always use the same anylyzer on the indexing and search side. The problem is that search queries are normally very short and autodetection is hardly possible. If somebody now enters something parsed by the query parser using this auto-language-analyzer, the detection may fail and the wrongly-stemmed analyzer tokens will hit no results.
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: weidong sun [mailto:lmcw...@gmail.com] > Sent: Thursday, May 14, 2009 5:19 PM > To: java-user@lucene.apache.org > Subject: Re: Question wrt Lucene analyzer for different language > > Thanks for the suprising quick response. :-) > > What I mean "correctly" here is that the specific analyzer can tokenize a > text mixed with English and that sepcfic langauge, for example, "12345 > ????" > or "????Text???" (where '?' is a character of that specific language and > "12345" and "Text" is english character) to have "12345" and "Text" > treated > as token and indexed as well. > BTW, I don't see a needs for stemming it by far since the information our > project encountered is just user's profile info. > > For the perticular ChineseAnalyzer, Can it do that? > > > On Thu, May 14, 2009 at 10:37 AM, Erick Erickson > <erickerick...@gmail.com>wrote: > > > No. What is "correctly"? Are you stemming? in which case using thesame > > analyzer on different languages will not work. > > > > This topic have been discussed on the user list frequently, so if you > > searched > > that archive (see: http://wiki.apache.org/lucene- > java/MailingListArchives) > > you'd find a wealth of information quickly... > > > > Best > > Erick > > > > On Thu, May 14, 2009 at 10:11 AM, weidong sun <lmcw...@gmail.com> wrote: > > > > > Hello, > > > > > > I am a newbie in Lucene world. I might ask some obvious question which > > > unfortunately I don't know the answer. Please help me 'grow'. > > > > > > We have a project intend to use Lucene search engine for search some > > user's > > > info stored our system. The user info might not be in English even it > > will > > > be stored in UTF-8 encoding. > > > > > > My question is, if I use one particular Lucene analyzer for a language > > > other > > > than English (e.g. ChineseAnalyzer or ArabicAnalyzer), can it still > able > > to > > > handle it correctly if user info is mixed with English character/word? > > > > > > Really appreciated with any answers. > > > > > > :-) > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org