On 11/25/2013 06:10 PM, Jeremy Whiting wrote:
> Shivam,
>
> This is a very interesting project. Could you go into a bit of detail
> about the technical aspect of this? Another developer is working on
> using libtextcat to detect language and change the language of the kde
> text-to-speech system (Jovie) based on the detected language. It
> sounds to me like there could be some overlap between what he's doing
> and what you are doing also.
>
> thanks,
> Jeremy
>
> On Mon, Nov 25, 2013 at 3:53 PM, Christoph Feck <christ...@maxiom.de> wrote:
>> On Monday 25 November 2013 23:32:15 Shivam Makkar wrote:
>>> [...]
>>> So, I request you to upload as many articles as you can in various
>>> languages (or at least one in your native language) so that it can
>>> be detected by the algorithm.
>> Many NLP researchers simply use Wikipedia text. Regarding topic
>> coverage, peer-reviewed grammar and spelling, you will have a hard
>> time to beat it. You can find the raw XML as .bz2 downloads at the
>> Wikipedia sites. Stripping the XML/Wiki formatting away and leaving
>> only the text is a simple task for any Perl script coder.
>>
>> Christoph Feck (kdepepo)
I agree it would be nice to see some details about the
logic/implementation.

My understanding is that statistical analysis works good if you have
good amount of generic text. If you have a chat window with 3 lines of
4-5 words, not necessary in full sentences, with slang, abbreviations
etc it would be hard to detect the language properly unless you have
some other tricks up your sleeve. Also chat window may get some more
text over time but chances are you already switched to that language and
per window/application memory-based layout will work here as well (if
not better). There are of course some shortcuts you could use:
characters or their sequences used only in some languages etc.
I think there were several projects that tried this before so it would
be good to analyze what they achieved and why/if they failed.
Also there's an issue with tabs (or other internal separators) - it's
easy to catch when user switches the window, it's not as easy when tabs
are switched as they are implemented in different ways by different
toolkits. E.g. in Firefox the user can have dozens of tabs that use
different languages...

IMHO the trick of this (and my guess that's where other attempts fell
short) is to make it very reliable, if guessing module switches to the
right keyboard only 70-80% of the time the users would most probably
prefer manual switching. You also need this to work (reliably) with as
many languages as possible, e.g. most of the bug reports I see for
keyboard layouts in KDE are for the users of the languages that are not
in the list :)

Having said that it would be really nice to have this feature in KDE.

Andriy

P.S. LanguageTool project is considering right now to start using
frequency dictionaries for spelling suggestions, although there's no
dictionary ready yet
(http://sourceforge.net/mailarchive/message.php?msg_id=31677788)
P.P.S. Shivam, no Ukrainian in the list? It would be hard to get
positive review from keyboard module maintainer ;)

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Reply via email to