to complicate it further ... the text for which language identification has
to be done is small, in most cases a short sentence like " I like Pepsi ".
Can something be done for this ?

On Fri, Mar 14, 2008 at 8:18 PM, Mathieu Lecarme <[EMAIL PROTECTED]>
wrote:

> Itamar Syn-Hershko a écrit :
> > For what it worths, I did something similar in my BidiAnalyzer so I can
> > index both Hebrew/Semitic texts and English/Latin words without
> switching
> > analyzers, giving each the proper treatment. I did it simply by testing
> the
> > first char and looking at its numeric value - so it falls between Hebrew
> > Aleph and Taph then its Hebrew, else its Latin. I wonder how you would
> spot
> > a French word in an English text for instance (aren't there parallel
> words?)
> >
> > Itamar.
> With ngram statistic compare.
> Finding foreign word in a sentence is very difficult, many words are
> very similar, and some are "faux amis" : same differents means in each
> language.
> Querying in mixing language seems to be a bit vicious. Mixing alphabet
> is more common (and easier to handle).
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to