Your example with Arabic wouldn't work reliably either - there are several other languages that use the Arabic script (Persian for example).

You could also try to pick out characters that are unique to a particular language - for example, Ä or Å only occur in Polish (as far as I know...). Of course, you have no guarantee that a Polish-language query will actually contain any of those characters - so this method would only work as a supplement to another method.

And don't forget that some words are written the same in several different languages.

This is the sort of problem that the end user can solve much better than the software can.

-MB


On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:

Can you not provide the user with a option list to specify their input
language?

Language identification can be a pretty tricky field. There are some tricks
you can do with unicode to identify language, e.g., \u0600 - \u06FF contains
the Arabic characters, so if you're input contains lots of chars within this
range, you can guess that the input is Arabic, for example.


The problem comes with differentiating between the languages that use a Latin
alphabet. Again, there are multiple approaches, although the only one I know
of that worked pretty well for identifying European languages was to build a
model based on character bigrams (that is, sequences of two letters) [1]


At the end of the day, Lucene cannot help you in choosing the correct language
as it doesn't know, and so it'll be up to you to add the necessary logic to
tell Lucene which Analyzers to utilise. :(


Andy

[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C. Bigram and
trigram models for language identification and classification in: Evett, L &
Rose,T (editors) Computational Linguistics for Speech and Handwriting
Recognition AISB'94 Workshop University of Leeds/AISB. 1994.


On Monday 11 Apr 2005 01:21, Eric Chow wrote:
Hello,

If I don't know the language of the input terms, how can I use
different analyzer to search it ?

For example, the input box accepts UTF-8 search text, they can be
anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
can search any of them or all of them with Lucene?

Any example, please?


Best Regards, Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to