On Fri, Feb 11, 2011 at 10:42:49AM +0100, Samuel Thibault wrote: > Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit : > > PS: I assume that a spell checker can be configured that way that it > > can distinguish between writing an English text with some / several > > mistakes and a text with say 50% error rate which is probably not > > understandable anyway. > > Mmm, I think we've already had users that have even 50% error rate, > simply because they mispell things. Yes, not everybody has even a basic > knowledge level in english, but they still can provide useful input to a > mailing list.
It might be a topic of fuerther investigation what limit on the error rate to put but I'm quite positive that there are reasonable algorithms to detect in what language a text is in or rather to detect whether a text atempts to be written in a certain language (which is probably easier than to guess a language). The question whether it is worth doing some stats on the mailing list archive about this is rather if we finally want this language detection method for a SPAM filter or not. My guess is that you will find a ratio of misspelled words / total number of words which is a clear sign for non-English text, than you have some intermediate area where those postings like you are afraid about are belonging to and than there are the postings which are obviosely trying hard to write some English. I'd like to get rid of the clearly non-English texts. I have the impression that we get more and more of these since some time and I assume that bayesian filters are not (yet) trained good enough to detect these as SPAM. So we need to find some other means. Kind regards Andreas. -- http://fam-tille.de -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20110211104413.gb2...@an3as.eu