Maybe you can use PerFieldAnalyzerWrapper.
(I never used this)

Ernesto.

Eric Chow escribió:

But how about one document contains more than two different languages ??


Eric

On Apr 12, 2005 12:13 AM, Andy Roberts <[EMAIL PROTECTED]> wrote:


On Monday 11 Apr 2005 14:55, Mike Baranczak wrote:


Your example with Arabic wouldn't work reliably either - there are
several other languages that use the Arabic script (Persian for
example).


Good point. Although you could try a simple approach to test for the
additional characters that exist in Persian but not in Arabic. Although, this
again is not fool-proof. A letter-model approach would be better but is
rather time consuming.



This is the sort of problem that the end user can solve much better
than the software can.



I completely agree, which is why I originally suggested prompting the user for
this info. It may be the case that for the majority of queries, English is
the usual language. And it is probably more feasible to do a test to
determine whether the query English or not (still very tricky, mind). If not,
then prompt the user to specify their input language because otherwise,
results will be poor.

Andy Roberts



-MB

On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote:


Can you not provide the user with a option list to specify their input
language?

Language identification can be a pretty tricky field. There are some
tricks
you can do with unicode to identify language, e.g., \u0600 - \u06FF
contains
the Arabic characters, so if you're input contains lots of chars
within this
range, you can guess that the input is Arabic, for example.

The problem comes with differentiating between the languages that use
a Latin
alphabet. Again, there are multiple approaches, although the only one
I know
of that worked pretty well for identifying European languages was to
build a
model based on character bigrams (that is, sequences of two letters)
[1]

At the end of the day, Lucene cannot help you in choosing the correct
language
as it doesn't know, and so it'll be up to you to add the necessary
logic to
tell Lucene which Analyzers to utilise. :(

Andy

[1] Churcher, G E; Hayes, J; Hughes, J S; Johnson, S; Souter, C.
Bigram and
trigram models for language identification and classification in:
Evett, L &
Rose,T (editors) Computational Linguistics for Speech and Handwriting
Recognition AISB'94 Workshop University of Leeds/AISB. 1994.

On Monday 11 Apr 2005 01:21, Eric Chow wrote:


Hello,

If I don't know the language of the input terms, how can I use
different analyzer to search it ?

For example, the input box accepts UTF-8 search text, they can be
anything, such as Chinese, Japanese, English, Russian, Deuch, etc. How
can search any of them or all of them with Lucene?

Any example, please?


Best Regards, Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]





-- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina.




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to