to complicate it further ... the text for which language identification has to be done is small, in most cases a short sentence like " I like Pepsi ". Can something be done for this ?
On Fri, Mar 14, 2008 at 8:18 PM, Mathieu Lecarme <[EMAIL PROTECTED]> wrote: > Itamar Syn-Hershko a écrit : > > For what it worths, I did something similar in my BidiAnalyzer so I can > > index both Hebrew/Semitic texts and English/Latin words without > switching > > analyzers, giving each the proper treatment. I did it simply by testing > the > > first char and looking at its numeric value - so it falls between Hebrew > > Aleph and Taph then its Hebrew, else its Latin. I wonder how you would > spot > > a French word in an English text for instance (aren't there parallel > words?) > > > > Itamar. > With ngram statistic compare. > Finding foreign word in a sentence is very difficult, many words are > very similar, and some are "faux amis" : same differents means in each > language. > Querying in mixing language seems to be a bit vicious. Mixing alphabet > is more common (and easier to handle). > > M. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >