On 2/2/2013 8:58 PM, John Hardin wrote:
That's the difficult part.
It's easy to look for specific strings in the body, or specific things
like the ratio of text to whitespace or text to images, but trying to
*interpret* the text to do something like detect which language it is in
is a *hard* problem. Even more so if you want to detect that the message
body is in more than one language, and determine the ratios.
The closest we can come today is to look at the character set of the
message and try to guess from that whether the *entire* message is in a
"foreign" language. This runs into problems where the character set of
the message supports multiple languages, like UTF-8 or some of the
character sets used by Windows.
Do you have Bayes enabled? If so, are you training these messages as
spam? If you are doing this, then they should eventually hit BAYES_99
and if there are any other spammy characteristics that would probably be
enough to detect them.
If you would upload a few of these spams to someplace like pastebin and
point us at them then we will be able to do better than just guess and
make general suggestions.
Yes I do understand that it's hard.
I worked a bit with perl so I might be able to write something that will
do that if dosn't exists already.
I will try to explain even more.
The problem is that I get the mail with an example of the SPAM content
which didn't came from EMAIL and just to categorize it as SPAM.
This is not how and for what SA was built for but it gives very good
results in general.
This is a specific case.
I have an active system which someone wrote in C# that scans the chars
etc but the problem is that it's in C# and it's an active check that
crawls the site and parsing it rather then a restful system that
triggers the checks when needed.
This is an example of the content:
http://www.fpaste.org/yFOC/
It can be even some CMS post that someone got and he want's to
categorize as spam.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
Thanks,
--
Eliezer Croitoru
http://www1.ngtech.co.il
IT consulting for Nonprofit organizations
eliezer <at> ngtech.co.il