On Wed, 31 Dec 2014, Martin Gregorie wrote:
During last night I received a phishing message with a new (to me
anyway) form of obfuscation which can only be used inside HTML body text
using us-ascii encoding. The obfuscation was apparently aimed at SA and
similar scanners because its not obvious to anybody reading the message:
every 'o' (0x6f) in the text is replaced by ο
My Perl-fu isn't good enough to encode this in a regex - can anybody
help?
Take a look at 25_replace.cf (esp. tags C and E), and the various FUZZY_*
rules. It's not feasible to do broadly, but specific commonly-obfuscated
words and short phrases can be focused on and that potentially would help
Bayes recognize such as spammy more quickly.
I've been extending 25_replace.cf as I see more different types of
obfuscation like this, but it's a bit hard to keep up. Given a list of
Unicode code points that look like specific Latin letters, it should not
be hard to automatically generate the tag subrules for obfuscation for
all the encodings.
Is there such a list anywhere already that could be leveraged? I know we
were discussing unicode normalization of body text at one point, is there
anything there we could use?
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is not the business of government to make men virtuous or
religious, or to preserve the fool from the consequences of his own
folly. -- Henry George
-----------------------------------------------------------------------
944 days since the first successful private support mission to ISS (SpaceX)