Re: Language identification ??

Mathieu Lecarme Fri, 14 Mar 2008 07:49:25 -0700

Itamar Syn-Hershko a écrit :

For what it worths, I did something similar in my BidiAnalyzer so I can
index both Hebrew/Semitic texts and English/Latin words without switching
analyzers, giving each the proper treatment. I did it simply by testing the
first char and looking at its numeric value - so it falls between Hebrew
Aleph and Taph then its Hebrew, else its Latin. I wonder how you would spot
a French word in an English text for instance (aren't there parallel words?)


Itamar.

With ngram statistic compare.

Finding foreign word in a sentence is very difficult, many words arevery similar, and some are "faux amis" : same differents means in eachlanguage.Querying in mixing language seems to be a bit vicious. Mixing alphabetis more common (and easier to handle).


M.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Language identification ??

Reply via email to